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1*  PORPOSE 

1.1  SCOPE 

This  report  discusses  the  work  performed  for  the  U.  S.  Army  Electronics 
Research  and  Development  Lahoratoiy  under  Contract  No,  DA  36-039-SC-90787 
during  the  period  from  1  October  1962  to  31  December  1962. 

1.?  OBJECTIVES 

The  objective  of  this  project  is  to  investigate  the  techniques  and 
concepts  of  information  retrieval  and  to  formulate  and  develop  a  general 
theory  of  information  retrieval.  The  formalization  of  this  theory  is 
oriented  to  the  automation  of  large -capacity  information  storage  and 
retrieval  systems.  This  theoretical  framework  will  be  the  basis  for  the 
utilization  of  general  purpose  stored-program  digital  computer  systems 
foy  performing  the  storage  ani  retrieval  functions. 

1.3  PROJECT  TASKS 

During  the  first  quarter  of  this  project  a  preliminary  model  of  the 
information  storage  and  retrieval  problem  was  developed  as  a  frame  of 
reference  for  subsequent  analysis .  This  quarter  was  spent  in  more  detailed 
investigations  of  significant  aspects  of  the  problem  as  related  to  the 
transformational  ftinctions  of  the  model. 

In  the  analysis  of  any  complex  problem  there  are  essentially  three 
levels  of  understanding  to  master:  the  whole,  the  parts,  and  the  relation 
of  the  parts  to  the  whole.  The  preliminary  model  constitutes  the  whole j 
the  transformation  functions  comprise  the  parts.  However,  there  are  a 
number  of  alternative  approaches  that  may  be  considered  for  each  part  or 


function.  These  approaches  become  the  specific  tasks  or  subtasks  of  the 
project. 

At  this  stage  many  ramifications  of  the  transformational  functions 
hare  been  analyzed.  Although  these  studies  pertain  to  manifest  tasks, 
they  have  not  been  formally  designated  as  such.  The  process  of  formaliza¬ 
tion  depends  upon  a  review  of  the  relation  of  each  part  to  the  central 
problrans  of.  the.  whole.  Specific  tasks  will  be  assigned  during  the  next 
quarter,  and  subsequent  reports  will  be  oriented  to ' the  activity  performed 
under  these  designated  tasks. 

This  discussion  does  not  vitiate  the  statement  in  this  section  of  the 
First  Quarterly  Report,  In  that  report  three  tasks  were  defined^  but 
these  tasks  pertain  to  methodology  rather  than  functional  requirements. 

At  this  stage  of  the  project  it  is  essential  to  shift  from  a  methodological 
to  a  functional  viewpoint. 


2 


2.  ABSTRACT 


This  report  discusses  research  activity  performed  in  the  investigation 
of  the  techniqaes  and  concepts  of  information  retrieval.  The  general 
problems  of  information  storage  and  retrieval  are  reviewed  to  establish  a 
framework  for  the  development  of  general  theoretical  principles.  Several 
functional  characteristics  of  the  preliminary  model~the  representation  of 
file  items j,  file  organization,  system  design  and  synthesis,  and  relevance — 
are  summarized  in  terms  of  tentative  solutions  and  their  attendant  diffi¬ 
culties.  Specific  aspects  of  the  problem-information  theoretical  methods 
of  document  categorization  and  corrective  procedures  for  automatic 
indexing— are  examined  in  detail. 
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3.  PUBLICATIONS.  HEPORTS.  AND  CONFERENCES 

3.1  TECHNICAL  NOTES 

The  following  internal  technical  nemoranda  were  issued  during  this 
reporting  period; 

(a)  lEC  TECHNICAL  NOTE,  Pile  No.  P-AA-TN-(00li3)-N,  18  December  1962j 
A  Measure  of  Effectiveness  for  Document  Retrieval  Systems, 

Quentin  A.  Darmstadt. 

(b)  lEC  TECHNICAL  NOTE,  File  No.  P-AA-TN-(OOUi)-N,  20  December  1962j 
CorrectiTO  Procedures  for  Automatic  Indexing  Systems,  Alexander 
Szejman. 

(c)  lEC  TECHNICAL  NOTE,  File  No.  P-AA-m-(OOli5)-N,  2?  December  1962; 
Information  Theoretical  Methods  of  Document  Categorization, 

Alfred  Trachtenberg. 

(d)  lEC  TECHNICAL  NOTE,  File  No.  P-AA-TN-(OOli6)-N,  31  December  1962; 
Survey  of  Mathematical  Models  of  Various  Aspects  of  Information 
Retrieval,  Quentin  A.  Darmstadt. 

These  technical  notes  are  dated  at  the  time  of  their  completion;  these 
dates  do  not  necessarily  correspond  to  the  date  of  publication, 

3.2  REPORTS 

The  following  reports  were  issued  during  this  reporting  period: 

(a)  RESEARCH  IN  INFORMATION  RETEIIEVAL;  First  Quarterly  Report, 

1  J-uly  1962  -  30  September  1962,  Technical  Report  P“AA-TR-(0010), 
(Mantiscript  Version),  30  October  1962. 

(b)  MONTHLY  LETTER  REPORT  NO.  3,  1  October  1962  -  31  October  1962, 

File  No.  P-AA-TR-(0012),  31  October  1962;  Research  in  Infoimation 
Retrieval.  Alfred  Trachtenberg, 

(c)  MONTHLY  LETTER  REPORT  NO.  U,  1  November  1962  -  30  November  1962, 
File  No,  P-AA-TR-(0025),  30  November  1962;  Research  in  Informa¬ 
tion  Retrieval.  Alfred  Trachtenberg. 
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3.3 


COMFERMCES 


The  following  conferences  were  held  between  lEC  and  USAELRDL  personnel 

(a)  29  November  1962 — Meeting  at  lEC.  lEC  periionnel  were  introdticed 
to  Mr,  Anthony  V,  Cainpi,  who  had  recently  been  assigned  as 
Project  Engineer,  Several  aspects  of  the  Pirst  Quarterly  Report 
were  discussed,  and  the  concepts  pertaining  to  measure  of  rele¬ 
vance  were  clarified.  lEG  accepted  the  suggestion  that  the  dis¬ 
cussion  in  the  report  should  be  elaborated  in  more  detail. 

Mr.  Quentin  A,  Darmstadt  attended  the  conference  entitled  "Mathematics 
of  Information  Storage  and  Retrieval,"  which  was  conducted  by  Dr,  Robert  M 
Hayes  under  the  auspices  of  the  Georgia  Institute  of  Technology  from  3  to 
7  December  1962.  The  relevance  of  the  conference  to  this  project  is  evi¬ 
dent  in  the  title.  However,  because  of  general  significance  of  the 
oenferenee,  attendance  was  siMtisored  by  IRC. 
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FACTUAL  DATA 


U.l  STATEMENT  OF  THE  PROBLEM 

The  technical  reqairoment  for  this  project,  as  stated  in  SCL-U355, 
specifies  "...a  research  investigation  of  techniques  and  concepts  neces¬ 
sary  for  the  efficient  mechanization  of  large-capacity  information  stor¬ 
age  and  retrieval  systems,"  The  future  applied  objectives  suggested  as 
grrides  for  such  research  constitute  a  range  of  "...problems  of  military 
significancej  i.e,,  personnel  files,  intelligence  data,  etc." 

The  problem  as  presently  conceived  is  to  develop  a  general  theory  of 
information  retrieval  whose  primary  goal  is  its  use  as  a  system  tool  for 
the  optimum  design  of  specific  information  retrieval  systems  in  the  future. 
The  project  is  oriented  to  a  theory  of  systems  that  can  be  applied  to  the 
design  of  specific  job  oriented  systems  in  their  entirety  rather  than  to 
a  specific  procedure (s);  to  dealing  with  real  contexts  that  may  be  of 
interest  to  the  Amy,  wherever  possible,  rather  than  necessarily  limit¬ 
ing  the  study  to  abstract  formalismj  to  the  consideration  of  optimum 
hardware  once  software  at  the  level  of  algorithm  rather  than  machine  code 
has  been  specified;  and  to  the  problem  of  conversion  to  canonic  fom 
when  linguistic  con^jlexity  is  not  the  critical  problem, 

A  general  model  of  the  information  retrieval  process  has  been  developed. 
This  model  provides  a  framework  both  for  understanding  the  critical  fea- 
tiires  of  information  retrieval  systems  of  different  levels  of  sophistica¬ 
tion  and  for  isolating  critical  areas  of  information  retrieval  procedures 
and  teofaniques  to  focus  upon  for  further  development. 
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h»2  SYSTEM  MODEL 

It.2,1  Analytic  Framework  -  The  information  retrieval  model  developed 
in  the  Slrst  Quarterly  Report  forms  the  basis  for  the  analytic  framework. 
This  model  defines  information  storage  and  retrieval  formally  and  abstractly, 
although  the  model  is  quite  simple.  The  three  algorithmic  transformations 
isolated  (D,  E,  P)  do  not  presr^jpose  any  specific  form  of  classificatory 
or  interrogatory  vocabulary,  nor  do  they  depend  upon  any  unique  search 
procedures  or  file  structure.  Furthermore,  there  is  no  precomraitment  in 
allocating  the  functions  to  manijal  or  machine  processing. 

Ultimately,  the  model  shOTjId  encompass  a  completely  automated  infor¬ 
mation  content  storage  and  retrieval  system.  Such  a  system  is  infeasible 
in  the  present  state-of-the-art  of  automating  human  cognitive  functions. 

Only  the  processing  or  P  transform  far  limited  document  retrieval,  with 
fairly  imprecise  but  humanly  generated  indices,  is  currently  being 
automated.  Even  for  this  limited  application  the  logical  file  organiza¬ 
tion  and  search  proced-ures  as  well  as  their  implementation  can  be  sub¬ 
stantially  inproved. 

The  stuc^  of  sophisticated  file  organization  and  search  procedures 
for  traditional  information  retrieval  systems  will  continue  to  be  an 
aspect  of  this  program.  Even  more  important,  however,  will  be  the  devel¬ 
opment  of  file  organizations  and  search  procedures  for  the  efficient 
implenentation  of  system  capabilities  that  will  have  to  evolve  before 
fully  automated  information  content  storage  and  retrieval  systems  can 
be  deraloped. 
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These  new  captibU-itles  Inclnde  the  automation  of  functions  that  can 
currently  he  performed  only  by  people  and  the  development  of  explicit 
transformation  algorithms  for  the  models  One  of  the  most  difficult  areas 
for  automation  is  the  formalization  of  ordinary  language  to  describe  the 
information  in  a  form  s\iltable  for  efficient  storage  and  effective 
retrieval.  This  problem  pertains  to  the  input  and  query,  or  D  and  E 
transforms,  respectively.  The  question  of  linguistic  analysis  per  se 
has  been  deenphaslzed.  However,  the  more  general  problem  of  improving 
and  automatiag  the  D  and  E  transforms  Is  essential  to  the  goals  of  the 
project. 

There  are  a  number  of  relatively  discrete  capabilities  that  will  have 
to  be  developed,  primarily  in  the  input  and  query  transforms.  It  is  pos¬ 
sible  to  describe  several  procedurally  oriented  tasks  for  producing  these 
capabilities.  Each  of  the  model  transforms,  D  (data  input),  E  (query), 

P  (processing),  and  (ou'^nit),  will  be  considered  in  turn, 

h,2,2  The  D  Transform  -  The  central  problem  in  the  transformation 
of  information  inputs  to  forms  usable  in  storage  and  retrieval  is  one  of 
classifying,  categorizing,  or  indexing.  To  date  all  operational  clas- 
sificatoiy  schemes  tend  to  be  intuitively  formulated,  manually  imple¬ 
mented,  and  statically  evolved;  these  schemes  are  virtually  inpossible 
to  change  systematically. 

There  are,  therefore,  three  areas  in  lAlch  further  capabilities'  mhst 
be  developed: 

(a)  Ebq^eit  procedures  for  establishing  useful  category  groupings 
and  boundaries. 
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(b)  DefinltlTe  proeediires  for  automatically  assigning  It^ns  or 
documents  to  index  categories  accurately  and  efficiently. 

(c)  Methods  for  improving  the  precision  of  indexing. 

The  methods  for  in^iroving  precision  include  adaptive  procedures  for 
altering  index  assignments  to  align  document  categories  more  closely 
with  the  users*  categories  as  a  fcmction  of  feedback  on  the  adequacy 
of  Individual  searches. 

These  capabilities  are  in  some  measure  mutually  interdependent  and 
cannot  ultimately  be  developed  without  reference  to  other  system  trans¬ 
forms,  Similarly,  the  capabilities  of  other  system  transforms  will 
impinge  tpon  the  organisation  of  the  D  transform.  Thus  the  development 
of  useful  and  efficient  category  groupings  of  descriptors  or  incJices  may 
be  best  considered  in  relation  to  specific  schemes  for  automatic  docu¬ 
ment  classification,  Ihe  wrk  of  Borko  and  Bemick  [6]  illustrates  this 
approach.  Similarly,  the  validity  of  adaptive  procedures  for  reorganizing 
descriptor  assignment  is  clearly  dependent  upon  the  techniques,  automatic 
or  manual,  used  to  assign  item  categories  initially. 

It  is  important  to  note,  however,  that  these  three  capabilities  are 
distinct;  work  may  proceed  relatively  independently  with  reasonable 
eaqectation  of  later  integration  into  a  system  concept.  The  work  of 
Borko  and  Bemick  fails  to  demonstrate  that  joint  consideration  of  auto¬ 
matic  categpiy  generation  and  automatic  category  assignment  results  in 
either  an  inproved  category  structure  or  an  improved  prediction  scheme. 
Furthermore,  attacking  these  problems  as  separate  capabilities  may  be 
advantageous  in  allocating  effort  more  efficiently  and  in  developliig 
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more  general  techniques .  Thus  work  on  the  problem  of  finding  ideal 
categories  for  grouping  items  into  larger  categories  may  result  in  tech¬ 
niques  for  decompassing  larger  items  into  coherent  small  ar  units  or 
categories.  The  latter  problem  is  part  of  the  more  general  problem  of 
developing  eaplicit  procedures  for  establishing  useful  categories  and 
their  boundaries —regazdless  of  the  level  of  organization  between  or 
within  items  that  the  categories  refer  to. 

This  discussion  does  not  iirply  that  work  on  the  explicit  generation 
of  usefvil  categories  should  necessarily  be  unconcerned  with  adequate 
automatic  prediction  of  a  priori  categories.  The  significant  point  is 
that  each  task  shotild  focus  upon  the  development  of  as  powerful  a  capa¬ 
bility  as  possible.  If  woi'k  in  one  area  suggests  an  approach  to  any 
other,  then  so  much  the  better. 

The  formal  development ,  of  each  of  these  problems  is  continuing. 
Various  techniques  such  as  the  theory  of  clumps  [21],  factor  analysis 

i 

K],  and  latent  class  analysis  [l]  have  been  suggested  for  dealing  with 
automatic  category  generation.  These  techniques  are  being  evaluated 
together  with  the  concepts  presented  in  subsequent  sections.  The  eval¬ 
uation  is  essential  for  the  ultimate  selection  of  the  most  useful  pro¬ 
cedure  for  categorization. 

It, 2, 3  The  E  Transform  -  The  E  transform  is  the  set  of  algorithms 
that  transposes  the  users*  queries  to  the  processor.  In  an  ideal  system 
the  E  tiransform  would  handle  any  query  couched  in  the  natural  language 
of  the  user.  The  present  state-of-the-art  in  Infoxnatlon  retrieval  is 


11 


far  too  primitive  to  deal  with  any  sophisticated  qtieiy.  Except  for 
specialized  files  such  as  those  developed  for  Baseball  [U],  ACSIMATIC 
[23  or  the  multi-list  system  of  Prywes  and  Gray  [9,  10  ],  questions  of 
fact  cannot  be  answered  by  contemporary  information  storage  and  retrieval 
systems. 

Both  Baseball  and  ACSIMATIC  do  contribute  to  the  conceptual  basis 
of  the  E  transform.  Baseball  analyzes  English  query  sentences,  and 
ACSIMA.TIC  provides  a  uniquely  articulated  query  format  appropriate  to 
the  intelligence  problem.  However,  both  are  inappropriate  to  the  general 
information  storage  and  retrieval  problem  in  their  present  status.  The 
contributions  of  Prywes  and  Gray  are  not  pertinent  because  the  problem 
they  address  is  primarily  in  the  area  of  file  organization  for  attribute- 
value  data.  "While  Prywes  and  Gray  do  not  contribute  to  the  problem  of 
the  E  transform,  their  work  is  important  relative  to  the  P  transform. 

These  statements  are  not  intended  to  be  derogatory  nor  to  denigrate 
the  significance  of  these  projects.  Therefore,  further  clarification  is 
warranted.  There  are  two  kinds  of  fact  retrieval: 

(a)  The  retrieval  of  facts  from  a  table  or  file  specifically 
organized  by  the  inventiveness  of  human  prograramei’s  for  the 
retrieval  of  the  summarized  facts, 

(b)  The  retrieval  of  facts  or  content,  the  implicit  goal  of  the 
preliminary  modei,  from  items  or  documents  couched  in  ordinary 
language. 

The  three  cited  systems  all  deal  with  restricted  and  specifically 
organized  data~baseball  scores,  combat  intelligence,  and  personnel 
files.  One  approach  to  the  direct  retrieval  of  facts  from  documentary 
items  is  to  assume  that  the  problems  of  the  P  and  E  transforms,  as 
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specified  for  these  systeins,  are  essentially  solved.  Then  the  only 
reaiaining  difficulty  is  to  reduce  facts  in  ordinaiT’  language  to  the 
proper  tabtilar  or  list  form. 

To  adopt  this  approach,  however,  is  counter-evolutionary.  The  burden 
of  development  remains  in  the  area  of  the  D  transform.  In  order  to  trans- 

I 

form  informal  data  automatically  into  the  foi^t  requiredjDy  these  systems, 
an  inordinately  long  time  may  pass  without  any  significant  advances  toward 
the  goal  of  automated  content  retrieval. 

At  present  the  only  query  allowed  for  documentary  data  is;  "What  doc¬ 
uments  contain  information  of  the  following  kind:  _ ?"  This  limi¬ 

tation  on  queries  has  many  shortcomings.  Not  aH  of  these  shortcomings 
must  be  overcome  simultasaeoualyj  an  evolutionary  approach  would  foetus 
Tqwn  expanding  query  capability  by  isolating  specific  problem  areas  and 
oonoehtrating  on  them. 

There  are  several  important  shortcomings  or,  conversely,  desirable 
capabilities.  The  first  is  a  limitation  to  documents.  The  query  capa¬ 
bility  should  be  estended  so  that  a  system  could  respond  with  either 
large  bounded  portions  of  larger  documents  or  with  an  automatically 
generated  extract  or  abstract  of  the  relevant  facts  in  the  docimient. 

As  these  capabilities  are  developed,  a  system  will  approach  the  goal 
of  allowirtg  questions  of  the  form;  "What  Infomiation  do  you  have  on...;" 
rather  than;  "What  documents. ..." 

The  second  shortcoming  pertains  to  a  limitation  to  all  documents 
containing  relevant  information.  It  is  practical  not  to  retrieve 
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infonnation  from  or  about  all  docmaats.  If  a  large  maiiber  of  documents 
cover  a  narrow  specialized  subject,  the  relevant  information  may  be  scanty, 
redundant,  or  qualitatively  poor.  In  such  cases  it  wo-uld  be  beneficial 
to  restrict  the  scope  of  retrieval  or,  initially,  indexing. 

Finally,  there  is  a  limitation,  in  the  extreme,  on  the  characteriza¬ 
tion  of  the  information  intended  by  the  conditional  phrase,  "...of  the 

following  kind;  _ ."  Different  operational  information  systems 

in^wse  different  limitations  of  this  type.  A  hierarchically  organized 
index  or  query  language  may  produce  such  unusual  classifications  of  new 
material  that  a  subsidiary  index  is  necessary  in  order  to  use  the  primary 
index  properly.  Freer  Iftiitenn  systems  are  limited  to  Boolean  functions 
of  two-valued  descriptors}  the  descriptor  is  either  present  or  absent. 

The  use  of  role  indicators  [22]  and  similar  devices  [3I]  offer  some  pos¬ 
sibility  of  impixjving  the  query.  But  the  crux  of  the  problem  is  to 
develop  a  query  capability  that  allows  a  user  to  state  his  question 
precisely.  This  ability  is  essential  to  useful  content  retrieval. 

The  three  problem  areas  cannot  produce  a  content  retrieval  system 
if  attention  is  restricted  to  the  E  transform.  The  P  trauisform  must 
evolve  to  be  able  to  handle  more  sophisticated  queries.  Similarly,  the 
organization  of  the  D  transform  must  be  capable  of  generating  the  required 
categories  and  preserving  the  information  for  a  range  of  anticipated 
queries.  Thus  work  on  the  categorization  aspect  of  D  transform  is  crit¬ 
ical  if  items  of  smaller  scope  than  an  entire  document  are  to  be  auto¬ 
matically  Isolated.  Similarly,  the  methods  for  improving  indexing  are 
essential  to  in^noving  the  precision  of  the  users*  queries. 


The  interdependence  between  the  D  and  P  transforms  does  not  invalidate 
appjroaching  the  problems  of  the  E  transform.  The  major  problem  with  some 
of  the  more  sophisticated  information  systems  is  that  so  little  thought 
was  given  to  the  query  process.  The  result  is  systems  that  are  too  cum¬ 
bersome  to  use.  It  is  essential  that  the  intentions,  requirements,  and 
capabilities  of  potential  human  users  be  carefully  analyzed  before  the 
organization  of  D  and  P  transforms  for  future  systems  are  fully  established. 
For  some  kinds  of  information  retrieval  such  as  general  education  and 
scholarly  research  the  open  stacks  and  card  catalogues  of  present  librar¬ 
ies  suffice.  For  other  information  retrieval  problems  such  as  keeping 
abreast  of  new  developments  or  resolving  specific  matters  of  fact,  inno¬ 
vations  are  vitally  necessary.  But  such  innovations  are  valueless  unless 
the  system  allows  the  user  to  ask  intelligent  and  appropriate  questions, 

I4.2.U  The  P  Transform  -  Advances  in  information  storage  and  retrieval 
depend  T:q)on  ijiqjroved  processing  algori,thms.  Unfortunately  advances  in  the 
other  transforms  will  influence  the  choice  of  processing  techniques.  It 
is,  consequently,  difficult  to  define  relatively  independent  problem  areas, 

A  basic  study  continues  in  the  ana!lysis  of  processing  requirements 
for  traditional  systems  and  for  new  capabilities  as  they  become  evident. 
Among  the  subjects  that  have  been  analyzed  relative  to  the  P  transfom 
axe: 

(a)  Measures  of  relevance  and  their  processing  applications. 

(b)  Measures  of  efficiency  and  their  optimization. 

(c)  Measures  of  cost  for  both  successes  and  failures. 

(d)  Search  theozy  and  procedures. 
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(e)  File  structiore  and  organization. 

(f)  System  synthesis, 

Obvions]^,  this  list  is  heterogeneous  and  requires  further  elaboration 
and  refinement.  Some  subjects  are  intimately  related  to.  other  system 
transforms  and  thus  depend  upon  the  outcome  of  advances  in  these  trans¬ 
forms,  Others  are  supraordinate  in  nature  and  are,  therefore,  perhaps 
best  deferred  until  a  specific  system  has  been  designed.  A  general 
approach  to  these  subjects  may  be  possible;  since  such  an  approach 
would  have  the  greatest  impact  on  the  processing  configuration,  these 
subjects  were  included  as  tentative  functions  of  the  P  transform. 

k,2.$  The  D~^  Transform  -  No  substantive  elements  of  this  transform 
have  been  defined. 

k,3  INFORMATION  THEORETICAl  METHODS  OF  DOCUMENT  CATEGORIZATION 

h»3tl  General  -  This  section  presents  some  applications  of  informa¬ 
tion  theory  to  the  problem  of  document  classification  or  categorization. 
Criteria  for  a  good  categorizer  are  presented,  and  various  information 
theoretical  measiares  that  measure  the  goodness  of  categorizers  are 
examined. 

The  problem  of  document  categorization  is  the  problem  of  selecting 
from  a  set  of  possible  categories  those  categories  to-'which  a  document 
TTuy  belong.  This  selection  would  have  to  be  based  upon  certain  clues 
or  indications  found  in  the  document  itself.  Thus,  as  Maron  [l?]  has 
stated,  the  problem  of  categorization  can  be  divided  into  two  parts: 
the  selection  of  certain  relevant  aspects  of  a  document  as  dues  toward 
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classificationj  and  the  use  of  these  clues  to  predict  the  proper  category 
to  •which  the  document  belongs.  Once  the  method  of  classificatioh  has 
been  defined,  then  the  procedures  could  be  au'toma-bed. 

Many  authors  [l,  2,  5,  7,  l6,  20,  25]  have  felt  that  the  occurrence 
of  certain  words  in  a  document  pro'vided  excellent  indications  of  the 
ca-begory  to  •which  that  document  belonged.  Based  upon  word  occurrence 
s'batistics,  document  categories  •would  be  predic-ted  automa^tically.  This 
approach  is  also  de^veloped  hare,  but  certain  information  theoretical 
techniques  are  applied  that  do  not  appear  -fco  have  been  applied  elsewhere. 

This  approach  assumes  that  a  group  of  human  experts  will  initially 
classify  a  number  of  documents  ‘into  a  given  set  of  categories.  A  basic 
assumption  is  that  all  ca-begories  that  recei-ve  'one  or  more  documents  -will 
bo  re-bained  as  permanent  categories,  which  will  be  the  only  categories 
used  in  the  fut^ure.  Another  assumption  is  that  the  ntonber  of  documents 
initially  classified  by  experts  is  large  enough  so  that  the  statis-bics 
of  this  group  may  be  assumed  to  reflect  the  sta-bistics  of  the  boc^  of 
documen-bs  that  may  later  be  automatically  ca-begorized.  In  other  •words, 
relative  frequencies  of  ca-begorization  ob-bained  from  the  ini-bial  group 
will  be  used  as  the  probabilities  of  ca-begorization  of  the  larger  group. 

U.3.2  Cri-beria  for  Selecting  Predictors  -  It  is  expecbed  that  the 
occTirrence  of  certain  words  in  a  document  indica-tes  the  categoriza-bion 
of  that  document.  It  follows  that  one  of  the  criteria  for  selec-bing  a 
particular  word  -to  predict  categories  is  that  its  occurrence  in  docu¬ 
ments  be  s-trongly  correlated  with  the  appearance  of  those  documents  in 
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a  particular  category — for  those  documents  that  were  initially  classified. 
In  other  words,  a  word  that  appears  in  every  document  of  a  particular 
category  and  appears  in  no  document  of  any  other  category  seems  to  he 
an  ideal  predictor  of  that  category.  In  practice  there  may  be  few  of 
these  ideal  predictors;  then  it  is  necessary  to  look  for  words  for 
which  occurrence  in  a  document  means  a  particular  category  for  that 
document  is  much  more  likely  than  any  other  category. 

This  criterion  would  be  STifficient  for  choosing  indicator  words  if 
the  distribution  of  documents  in  the  categories  were  uniform.  In  prac¬ 
tice,  this  condition  would  generally  not  be  the  case;  some  categories 
would  have  many  more  docTiments  than  others.  Then  a  word  that  would  seem 
to  be  an  excellent  indicator  might  be  found  to  supply  no  more  information 
than  the  total  distribution  of  documents  supplied.  Thus  the  occurrence  of 
the  good  indicator  word  in  documents  must  not  only  be  stron^y  correlated 
with  the  classification  of  these  documents  in  one  particular  category, 
but  the  distribution  of  docimients  containing  this  word  nrnst  also  markedly 
differ  from  the  distribution  of  all  the  documents. 

k»3»3  Information  Theoretical  Treatment  of  Predictor  Criteria 
U. 3.3.1  Statemeirt  of  the  Problem  -  The  problem  can  now  be 
e9q>ressed  mathematically:  Given  N  dociunents  classified  into  cate¬ 
gories,  where  j  -  1, ...k.  The  vocabulary  of  the  N  documents  contains  m 
words,  W^,  i  -  1, ...m.  Word  occurs  in  documents,  and  n^^  of  these 

The  classification  of  a  document  into  two  or  more  categories  is  counted 
as  the  classilication  into  rae  category  each  of  two  or  more  documents. 


dootnenfcs  fall  into  category  C^. 


Let: 


P(0j)  . 

P<Cj|Wi) 

Then: 

p(0j)  - 1 

and: 

P(Oj|W,) 

falls  into  category  Cj. 


■ 

The  foUcwing  relationships  hold  by  definition: 

S  n. ,  -  N. 
j  ^ 


N 


(U-i) 

(l;-2) 


(i;-3) 


It  has  been  assumed  that  there  exists  at  least  one  document  in 
each  category;  i.e.,  the  smallest  possible  =  1/N.  If  there  were  no 
documents  in  a  category  C  ,  then  p  would  be  zero;  consequently,  all  the 
Pj^^  would  be  zero.  Such  a  category  would  be  of  no  use  and  would  be  dis¬ 
carded.  Having  at  least  one  document  in  each  category  also  implies  that 

k  -  1 

k  <  N,  and  that  the  largest  possible  Pj  “  ^ - jj — 5  there  are  k  -  1 

categories  that  would  have  to  have  the  ninimam  p^.  Therefore: 


|<P. 


(U-U) 


and: 


0<Pij<l 


U>3*3«2  Definitions  of  Measures  of  Goodness . -  The  non-correlation 
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of  vord  occurrence  and  category  or  the  uncertainty  of  categoiy,  given  the 
oeourrence  of  a  trord  can  be  aiqpressed  by  Shannon's  foxanla  for  entropy 

H^-H(OjlWi)  .-rpjjiogpy  (U-5) 

Thus  a  good  indicator  word  would  have  a  low  But  is  this  word  supply¬ 
ing  more  infomation  than  the  total  doeunent  distribution?  Maron  suggest 
a  masure: 

1^  »  H  -  (U-:6) 

where:  H  -  H(Cj)  =  -  S  p^  log  Pj  (U-7) 

H  is  sin^jly  the  uncertainty  of  categorization  when  no  word  occurrences 
are  knownj  that  is,  H  is  the  entropy  of  the  a  priori  distribution  of 
aU  of  the  documents. 


This  measure,  however,  does  not  seem  adequate.  Difficulty 
arises  when  the  a  priori  p^  are  unequal  and  have  the  same  numerical 
value  as  the  p^j  of  different  categoriesj  in  this  case,  H  ■  and 
1^  =  0,  which  indicates  a  had  predictor;  but  may  actually  be  a  good 
one  in  terns  of  the  given  criteria.  The.  example  in  Figure  1  illustrates 
this  difficulty.  Clearly  H  =  and  =  0  In  Figure  1,  but  is  a 
good  predictor  and  supplies  a  great  deal  of  information. 


More  effective  measures  of  the  adequacy  of  an  indicator  word 
can  be  based  on  a  relative  entropy  function  of  the  type  found  in 
¥ataiiabe  [323.  This  function  is  similar  to  the  previous  entropy  func¬ 
tions,  but  it  accounts  for  the  a  priori  probabilities  directly.  The 


relative  entrvpy,  S^,  is  defined  by: 


Pi  -  .7 


7 


-  A  priori  distribution 

- Distribution  of  doctnnants 

containing  word 


I 

I 


i 


P2 


FIOnSE  1. 

s.  = 


c 
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Probability  Distributions  for  a  Class  of  Documents 


(U-8) 


where  A  is  a  positiYe  constant  chosen  to  keep  non-negative,  A  should 

be  chosen  such  that  A  =  l/p_»  where  p  <  p .  for  all  j,  so  that  S.  .  ■■  0, 

G  G  j  umn 

This  condition  means  that  k  <  A  £  N,  since  lA  £  P_  £  lA* 

o 


Before  these  measures  are  defined  and  examined,  one  more  entropy 
function  must  be  defined: 

=  -  S  Pj  log  Pj/A  »  H  +  log  A  (U-9) 


Three '  possible  measures  will  now  be  defined,  in  addition  to  the  measure 
that  Haron  has  suggested. 
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(Haron's  measure) 


Nov: 


“  H  - 

-  H  -  Si 

M3  -  -  Si 

-  log  A  -  Si 


(U-10) 


‘  Mg  ■  H  -  Hi  -  S  Pi^  log  Pj  -  log  A 

M3  -  H  -  Hi  -  E  Pi^  log  p^  -  Mg  +  log  A  \  (U-n) 

J 

-  -  L  pi^  log  Pj  .  .  E  p.  j  log  ^ 

The  nev  Mg  and  M3  are  similar  to  1^,  except  for  a  cross-term  that  reiates 
the  Pj  and  the  Pi^.  also  has  this  eross-tezn.  M^  is  sij!?)3y  with 
the  constant  term  missing. 


U.3.3.3  Maxima  and  Minima  of  the  Meastires  of  Goodness  -  The 
behaTior  of  these  measures  of  goodness  and  the  Tarious  entropy  functions 
are  dSTeloped  in  Appendix  A,  Section  8. 


U.3.3.1:  Evaluation  of  the  Measures  -  Measure  M^  was  shown  to 
he  inadequate,  since  it  may  erroneously  indicate  that  a  good  predictor 
is  a  bad  predictor.  In  addition,  ^  can  assume  negative  values.  Mg  can 
also  assTune  negative  values,  which  may  make  it  inconvenient  to  use.  Mg 
is  also  inconvenient  to  calculate,  since  it  reqTiires  the  calculation  of 


two  sums,  E  p .  log  p .  and  E  p.  . 
j  ^  ^  i 


and  since  the  l^t  suimatlon 


also  includes  a  division  operation. 


«3 


requires  the  calculation  of  these 
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same  sums,  although  it  is  slightly  more  convenient  to  use  since  is 
always  positive.  1^,  and  have  fairly  complex  expressions  for 
maxima  and'  mininaj  and  become  negative  and  never  reaches  zeroi 

It  seems  clear  then  that  is  the  best  measure  of  the  group: 
it  is  always  positive,  has  a  simple  expression  for  a  maximum,  has  a  zero 
ninimam,  and  is  easier  to  calculate  than  the  others. 


U.3.3.5  Mathematical  Expression  of  Predictor  Criteria  -  The 
correlation  of  the  occurrence  of  an  indicator  word  in  a  document  and  the 
classification  of  that  document  in  a  particular  category  would  be  measured 
by 


"l  ■  -  ^  P„  log  Py 
A  low  indicates  a  good  predictor; 


(0<H^<logk)  (1:-12) 

a  high  a  bad  predictor. 


A  measure  that  also  accounts  for  the  a  priori  distribution  of 
documents  and  indicates  how  much  more  infomation  the  predictor  supplies 


than  this  distribution  is 


\  -  S  Pij  log  - 


(0  ^  <  -  log  p,)  (U-IJ) 


(1/N  <  P*  <  lA) 

A  high  indicates  a  good  predictor;  a  low  a  bad  one.  Both  of 
these  measures  must  be  taken  into  account  when  choosing  indicator  words* 


1(..3.U  Predictors  -  On  the  basis  of  these  mathematical  criteria,  it 
is  now  possible  to  select  clues  or  predictors.  A  word  that  has  a  high 
value  for  and  a  low  valtie  for  will  be  selected.  The  cutoff  point 
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for  'bbBse  functions  for  good  predictors  must  be  determined  eaqjerimentally* 
It  is  diffictat  to  say  how  high  a  yalue  for  or  how  low  a  valne  for 
is  actually  needed  for  a  good  predictor  without  etopirical  verification* 

Not  only  can  single  words  be  used  as  predictors,  but  wrd  pairs,  word 
triplets,  and  higher  word  combinations  can  also  be  used  with  an  expected 
improvement  in  prediction.  The  mathematics  for  these  cases  is  essentially 
the  same}  the  only  difference  is  that  the  occurrence  of  word  pair 
or  word  triplet  is  considered  instead  of  the  single  word 

These  word  pairs  and  word  triplets  can  be  ranked  together  with  single 
words  on  the  same  scale,  and  their  effectiveness  as  predictors  can  then 
be  compared. 

k,3»$  Application  of  Clues  to  Predicting  Categories  -  Once  the  sig¬ 
nificant  predictors  have  been  detemdnsd,  it  is  possible  to  obtain  the 
probability  that  a  document  appears  in  a  category  on  the  basis  of  those 
predictors.  This  probability  is* 

. ) 

Haron  gives  an  approximation  to  this  probability •  I**  general,  this 

approximation  would  require  a  great  deal  of  calculation.  One  way  of 
approximating  the  probability  would  be  to  take  the  weighted  average  of 
the  category  probabilities  using  each  of  the  most  significant  indicator 
words.  Other  functions  of  these  words  might  also  approximate  the  prob¬ 
ability.  Thus,  in  general,  the  predicted  category  would  be  some  func¬ 
tion  of  the  category  probabilities  for  each  of  the  words.  Methods  for 
determining  suitable  functions  of  this  kind  should  be  imrestigated. 
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U.3.6  Modification  of  Categories  -  Implied  in  this  discussion  are 
criteria  for  modifiring  and  combining  categories  to  get  better  classifica¬ 
tion.  What  is  needed  is  a  set  of  categories  that  would  be  stron^y  cor¬ 
related  with  word  occurrence  and  that  wotild  yield  approximately  equal 
a  priori  categoiy  probabilities.  In  this  way,  there  would  be  words  with 
high  and  Icfw  In  fact,  these  two  measures  would  then  be  almost 
the  sane;  for  if  p^  =  l/k  for  all  then: 


“U  -  5  Pjj 


log  k  =  log  k  -  ^ 


(U-15) 


Thus  in  equalizing  the  categories,  if  for  some  W^,  is  high  and  there 
eodsts  at  least  one  such  W^  for  each  category,  then  the  classification 
would  be  a  good  one. 


U.3.7  Summary  -  The  criteria  for  selecting  appropriate  words  in  a  ■■ 
document  as  predictors  of  the  document  category  hare  been  presented. 
Representations  of  these  criteria  hare  been  demonstrated  in  terms  of 
information  theoretical  measures.  These  measures  have  been  analyzed 
and  eyaluated;  one  set  designated  as  and  was  finally  chosen  as 
the  most  effective.  An  indication  of  how  the  category  might  be  selected 
was  then  developed;  similarly,  an  indication  of  the  basis  on  which  the 
existing  categories  might  be  modified  to  improve  classification  was 
suggested. 


Although  this  discussion  has  been  presented  in  terms  of  selecting 
one  of  k  major  categories,  once  a  major  categoiy  has  been  deteimined, 
the  sane  process  can  be  used  to  determine  subcategories;  the  mathe¬ 
matics  are  identical,  and  subcategory  statistics  would  be  used. 
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U.U  CORRECTIVE  PROCEDURES  FOR  INDEXING  SISTEMS 


General  -  IMs  section  investigates  the  methods  and  feasibility 
of  applying  corrective  procedures  to  indexing  systems,  A  fundamental 
aspect  of  these  concepts  is  their  ultimate  adaptability  to  automated 
procedures.  The  first  part  of  this  discussion  presents  the  basic  ideas 
of  this  concept;  the  second  part  develops  the  concept  formally. 

The  Taxonomy  of  Indexing  Systems  -  Information  retrieval 
systems  consist  of  a  library  of  documents  and  set  of  indexing  rules  and 
procedures  for  linking  descriptors  to  documents.  The  documents  in  this 
context  refer  to  the  smallest  ensemble  of  information  subject  to  retrieval; 
these  documents  are  considered  as  being  indivisible.  The  indexing  rules 
and  procedures  theoretically  select  descriptors  that  bear  some  relation 
to  the  descriptors  used  by  people  who  will  interrogate  the  system. 

The  system  may  accept  new  documents  in  its  library;  the  documents 
are  then  classified  according  to  the  rules  and  procedures  of  the  index¬ 
ing  scheme  of  the  system.  The  system  is  not  necessarily  committed  to 
the  use  of  old  descriptors.  The  indexing  rules  allow  for  the  supply  of 
now  descriptors  with  the  acceptance  of  the  new  documents  by  the  library. 

The  user  specifies  his  requests  for  information  by  writing  a  sequence 
of  acceptable  descriptors  in  the  form  of  a  Boolean  function;  that  is, 
the  descriptors  joined  by  OR  and  AND.  The  user's  disposition  of  the 
descriptors  implies  the  existence  of  an  ideal  taxonomic  system.  The 
taxonomy  imposed  by  the  indexing  rules  and  procedures  constitutes  an 
eixtemal  taxonomy  or  a  priori  taxonony. 


A  corrective  procediore  will  cause  the  external  taxonony  to  evolve 
into  the  ideal  taxonony  on  the  basis  of  information  concerning  the 
adequacy  of  the  sets  of  documents  retrieved.  This  information  is  sup¬ 
plied  hy  the  Tiser. 

The  central  problem  is :  On  what  factors  does  the  functioning  of  a 
corrective  procedure  depend?  The  answer  to  this  problem  depends  upon  the 
elucidation  of  the  relation  between  the  ideal  and  the  external  taxonony. 
More  specifically,  the  hypothesis  depends  upon  the  concept  of  invariance. 
Invariance  pertains  to  the  a  priori  postulated  constancy  between  descrip¬ 
tors  in  the  two  taxonomies. 

This  discussion,  then,  will  advance  the  hypothesis  that: 

(a)  The  concept  of  relatedness  of  descriptors  can  be  made  numerically 
precise. 

(b)  The  concept  of  relatedness  can  serve  as  a  building  block  for 
more  complex  relationships  between  desciriptors. 

(c)  Some  such  relationships  are  postulated  as  being  constant; 
i.e.,  these  relationships  remain  invariant  in  both  the 
external  and  the  ideal  taxonomies. 

(d)  The  existence  of  such  constancies  forms  the  basis  for  select¬ 
ing  rules  of  reassigning  descriptors  among  documents. 

nie  remainder  of  this  section  will  attempt  to  validate  this  hypothesis 

and  describe  the  resultant  consequences. 

k»k»3  Formalization  of  the  hypothesis  -  Let  d^,  d2,...,d^  and  D^, 
D2,,.,,D^  be  descriptors  amd  documents,  respectively.  For  every  descrip¬ 
tor  there  corresponds  a  class  of  documents  spanned  hy  this  descriptor. 

In  set-theoretic  notation  this  concept  becomes: 

D):  d(D)  -  d^(D)]  (li-16) 
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which  nay  be  read  as  "the  set  of  all  documents  such  that  descriptor  d^^ 
applies  to  the  set."  .  To  avoid  cumbersome  notation,  the  abbreviation 
[D(d)l  will  be  used  to  represent  the  set.  The  number  of  documents  con¬ 
tained  in  such  a  set  will  be  denoted  by  M.  Then  M[D(d^)]  stands  for  the 
number  of  documents  contained  in  the  set  spanned  by  the  descriptor  d^. 


In  general,  every  Boolean  function  of  descriptors  oorresironds  a  set 
of  documents  spanned  by  these  descriptors.  Therefore,  "the  set  of  all 
documents  that  are  indexed  by  B(d),"  becomes: 


[D(B(d))] 

(U-17) 

For  exaiqple  . 

[DCdj^  A  (d^  V  d^))] 

(U-18) 

is  a  set  of  all  documents  that  have  as 

their  indices  the  descriptors 

d^  and  d^  or  d^  or  both,  among  others. 

It  is  clear  that  the  following 

relation  holds: 

[D(B(d))]  =  B[(D(d))] 

(U-19) 

This  expression  sigiufies  that  the  set  of  all  documents  spaiuoed  by  a 
Boolean  function  of  descilptors  is  equivalent  to  the  Boolean  function 
of  sets  spanxwd  by  these  descriptors.  By  analogy,  the  expression  Cd(B(D))3 
represents  a  set  of  predicates  contained  in  the  set  of  documents  described 
by  the  Boolean  function  6(D). 

The  relatedness  of  descriptors  or  their  Boolean  functions  is  defined 
as  the  number  of  documents  contained  in  the  intersection  of  classes 
spanned  by  these  descriptors  or  their  Boolean  functioiB  divided  by  the 
number  of  documents  spanned  by  the  union.  Foxnally,  this  definition 
becomes: 
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MCB.CDCd))  A  B.(D(d))] 

®dl^i(d)»  “  m  (S(dyj  V  BTroni  (Definition  1) 

^  3  (U-20) 

A  similar  concept  of  the  relatedness  of  doctntents  or  their  Boolean  func¬ 
tions  Is  defined  analogotislT': 

M[B.(d(D))  AB.(d(D))] 

“  M[B,(d(Di)  V  B'(d(DTn  (Defi«i-tion  2) 

^  J  (U-21) 

It  is  important  to  note  that  throughout  this  discussion  the  concepts  for 
descriptors  can  be  analogously  applied  to  documents.  The  subsequent 
de-velopnent,  howeTer,  will  be  limited  to  the  relatedness  of  descriptors. 

Since  the  external  taxonony  by  hypothesis  does  not  precisely  cor¬ 
respond  to  the  ideal  taxonony,  the  distinct  symbol^  6,  is  introduced  to 
represent  the  descriptors  of  the  user.  These  descriptors  are  only  dif¬ 
ferent  insofar  as  they  index  classes  of  documents  that  are  not  identical 
with  the  classes  of  documents  indexed  by  the  descriptors  of  the  external 
taxonon^.  Thus  for  any  descriptor  or  index  i,  [d^(D)]  and  [6^(D)]  are 
not  necessarily  identical,  even  though  the  descriptors  themselves  may  be 
the  same.  The  objective  of  corrective  procedures  is  to  adjust  the  appli¬ 
cation  of  descriptors  to  docnunents  so  that  the  two  sets  become  identical. 
The  corrective  procedures  may  have  fulfilled  their  task  if  the  objective 
is  approximated  to  the  extent  that  any  divergence  has  a  negligible  laqMiet 
upon  the  user. 

The  Basis  of  Corrective  Procedures  -  Assume  that  all  retrieval 
requests  consist  of  single  descriptors.  The  user  formulates  his  request 
in  terns  of  a  descriptor  6^  related  to  the  ideal  taxonony.  The  system 
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retrieves  all  docTunents  spanned  by  this  descriptor,  except  that  this 
descriptor  is  d^  in  the  external  taionony.  The  -user  then  decides  whether 
the  retrieved  collection  of  docuaentj?  is  satisfactory.  The  collection 
aay  not  satisfactorily  fulfill  the  user's  requirements  for  three  reasons: 

(a)  Too  many  documents  were  collected. 

(b)  Too  few  documents  were  collected. 

(c)  Some  doemnents  are  si^)erfluous  and  some  are  missing. 

The  corrective  i>rocedures  should  select  documents  more  in  consonance 
with  user's  needs  and  then  effect  pemanen(b  changes  in  the  application 
of  descriptors  to  documents. 

If  the  system  retrieves  too  many  documents,  the  system  may  select 
a  set  of  descriptors  that  are  most  related  to  the  user's  descriptor  and 
then  remove  from  the  retrieved  set  those  documents  spanned  by  the  related 
descriptors.  This  method  conceals  a  difficulty.  Although  a  measure  for 
relatedness  of  two  descriptors  has  been  defined,  no  technique  has  yet 
been  specified  to  select  clusters  of  most  related  descriptors. 

If  the  system  retrieves  too  few  documents,  a  set  of  descriptors  most 
closely  related  to  the  given  descriptor  is  assembled;  the  set  may  be 
limited  to  a  sin^e  descriptor.  A  Boolean  function  of  these  descriptors 
is  then  constructed,  and  documents  spanned  by  the  Boolean  function  are 
retrieved.  The  factors  that  detendne  the  nature  of  the  particular 
Boolean  function  of  descriptors  must  still  be  defined. 

If  some  documents  art  superfluous  and  some  are  missing,  the  problem 
may  be  handled  as  a  eombinatlon  of  the  speolfie  probleBs  of  too  many  or 


too  few  doconents .  More  reallstieally,  howerer,  some  problems  of  this 
type  are  generis,  and  specific  solutions  must  be  developed. 

Alter  the  originally  inadequate  set  of  documents  is  deleted  to  the 
satisfaction  of  the  user,  the  corrective  procedures  must  effect  permanent 
changes  in  the  extension  of  some  descriptors  so  that  the  denotation  of 
the  external  and  ideal  descriptors  approach  equivalence.  The  problem  is 
to  render  the  sets  [6^(D)  ]  and  [d^(B)]  ertenslonally  as  similar  as  pos¬ 
sible.  Several  corrective  procedures  may  be  used; 

Ca)  To  affix  the  Tiser's  descriptor  to  all  the  documents  and  oiily 
those  documents  in  the  acceptable  retrieved  set. 

(b)  To  delete  or  add  some  descriptors  selectively  from  the  set  of 
documents  spannedj  after  the  process  of  deletion  or  augmentation. 

(c)  To  delete  or  add  some  descriptors  selectively  to  the  documents 
that  wore  deleted  or  coitqplemented  from  the  originally  inadequate 
retrieved  set. 

(d)  To  effect  other  descriptor  changes  on  the  documents  not  affected 
by  the  processes  of  complementation  or  deletion. 

The  first  procedure  by  itself  will  not  produce  the  desired  trans¬ 
formation  until  all  descriptors  have  been  used  in  retrieval  processes 
at  least  once.  This  prospect  is  uninviting  for  any  document  collection 
with  a  large  number  of  descriptors.  If  such  procedure  were  feasible, 
there  would  be  no  reason  not  to  index  the  entire  collection  in  ths  ideal 
taxonony,  in  the  first  place.  In  addition,  the  procedure  of  complement¬ 
ing  the  original  set  of  documents  need  not  necessarily  lead  to  the  forma¬ 
tion  of  a  taxonomy  whose  extension  is  identical  to  the  ideal.  Rather, 
the  process  may  only  be  an  approximation;  that  is,  a  set  obtained  after 
a  series  of  conq^enentations  may  only  approximate  the  ideal  taxonony. 
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A  closer  look  at  the  remaining  three  procedures  and  their  inherent 
problems  is  necessary.  Consider  a  class  of  documents  spanned  by 

descriptor  d^,  Stqppose  that  the  viser  requests  all  ctocuments  under  the 
descriptor  6^,  a  descriptor  corresponding  to  d^.  The  class  [D(d^)lis 
retrievedj  it  does  not  fulfill  the  user's  requirements.  The  complemen¬ 
tation  procedure  results  in  foimation  of  a  new  class  [D'(d^)].  The  cor- 
rective  procedure  should  then  implement  changes  pertaining  to  the  distri¬ 
bution  of  the  remaining  descriptors  among  documents.  How  should  theiSfe 
changes  be  made?  Or,  to  rephrase  this  question,  on  what  shoTild  the  infer¬ 
ential  processes  be  based  in  order  to  ensure  that  the  Ideal  taxonomy  is 
approadnated? 

Assume  that  there  is  no  relation  between  the  external  and  the  ideal 
taxonomies.  In  this  case  the  first  stage  of  the  corrective  procedure — 
that  is,  the  complementation  of  the  selected  set— most  proceed  at  random. 
If  the  taxonony  imposed  upon  the  collection  of  documents  is  not  corr^ated 
with  the  taxoTumy  implied  by  the  user,  then  the  relatedness  of  descriptors 
to  one  another  will  be  of  no  help  either  in  reassigning  descriptors  or  in 
eoiiplensntlng  the  original  sets. 

The  possibility  of  developing  corrective  procedures  depends,  therefore, 
tpon  some  a  priori  relation  between  the  two  taxonomic  systems.  If  such 
relationships  exist,  then  it  must  be  expressible  in  terms  of  the  concept 
of  relatedness.  The  relatedness  of  descriptors,  in  one  system,  must 
resemble  the  relatedness  in  the  other.  The  concept  of  a  relatedness 
between  two  taxonomic  systems  isolates  the  particular  irryarlance  that 
characterizes  the  sets  of  documents  designated  by  certain  desczdptors. 


FormallT’,  an  inTariance  exists  if  is  true  whenever  j  is  true, 

where  R  is  a  relationship  between  descriptors .  There  need  not  be  some 
•universal  type  of  invariance  present  whenever  there  is  a  resemblance 
between  two  taxonomic  systems.  On  the  contrary,  depending  upon  the 
nature  of  the  data  to  be  retrieved,  the  in'rariance  between  the  ideal 
and  the  external  taxonoity  may  differ. 

Some  examples  may  clarify  the  concept  of  invariance.  First,  if  a 
set  of  documents  spanned  by  a  descriptor  in  one  system  contains  another 
set  of  documents  spanned  by  another  descrip-bor  auid  if  this  condi'tion 
implies  the  same  condition  for  the  corresponding  descrip-tors  in  the 
other  system,  then  the  invariance  might  be  called  nested  invariance. 
Formally: 

[D(dj)]  =>  [D(d^)]  -»  CD(6^)]  3  [D(6j^)]  (U-22) 

where  -•  indicates  "implies,"  and  3  indicates  set  inclusion. 


In  a  second  example  the  most  closely  related  descrip-tors  in  one  sys¬ 
tem  are  also  most  closely  related  in  another.  To  represent  this  type  of 
in'variance  formally,  let  (d.,  d.)*  be  an  ordered  pair  of  descriptors  that 
are  related  -to  each  other  as  follows: 


H^[(d^),  (d^)]  =  Max  R^C(d^),  (dj^)]  (for  all  k) 
If  then  (d^,  d^)*  -*  (6^,  the  relationship  of  being  m 


(U-23) 


most  close 


related  is  preserved. 


The  third  example  replaces  MAX  by  MIN  to  obtain  an  invariance  of 
being  the  least  closely  rela-bed  descriptor.  In  spite  of  the  formal  sim- 
ilari-^  between  the  most  and  least  closely  related  conditions,  there  is 
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a  fond-dable  practical  difference.  The  most  closely  related  condition 
preserres  an  invariance  between  a  descriptor  and  a  descriptor;  the  least 
closely  related  condition  preserves  an  Invariance  between  a  descriptor 
and  a  class  of  descriptors* 

As  a  fourth  example  the  concept  of  most  closely  related  descriptors 
may  be  applied  to  chains  of  descriptors.  In  such  a  relationship  one 
descitlptor  leads  to  another  to  form  an  associative  chain.  There  are 
many  non-equivalent  ways  of  formulating  the  conditions  for  the  existence 
of  such  a  chain.  One  is  to  let  <  d2,...,d^>be  an  associative  chain 

of  order.  Then  this  chain  is  defined  as: 

(a)  The  set  [d^,  d2,...,d^3  of  descriptors  comprised  in  the  chain 
contains  each  element  except  the  first  and  the  last  only  once, 

(b)  The  first  element  appears  twice;  it  is  also  the  last  element, 

(c)  Each  element  except  the  first  determines  its  successor  by 
selecting  the  second  most  related  descriptor.  The  first 
descriptor  determines  its  successor  by  selecting  its  most 
related  neighbor. 

Then,  if  every  associative  chain  of  n^  order  in  one  taxonomic  system 
corresponds  to  a  chain  in  another,  a  chain  invariance  of  n —  order 
exists .  The  elements  in  one  chain  correspond  to  the  elements  in  the 
other,  but  not  necessarily  in  the  same  order. 

There  are  a  number  of  additional  possible  relationships  that  remain 
invariant.  The  problem  is  to  select  those  that  realistically  relate  to 
the  properties  of  data  structures  and  their  associated  indexing  systems. 

If  these  invariances  exist,  rules  for  reassigning  the  descriptors 
■ay  be  deduced.  The  coneept  of  invariance  places  a  strong  constraint 


upon  the  tTpe  of  admissible  rules  that  can  be  formulated.  There  is  also 
a  relation  between  the  invariances  and  the  nature  of  the  convergence  and 
efficiency  criteria  in^iosed  i^jon  the  corrective  procedures.  The  in^ior- 
tant  question  is:  Given  a  specific  form  of  invariance  and  the  appropriate 
rules  for  complementing  sets  and  for  reassigning  descriptors,  how  many 
queries  must  elapse  before  the  external  taxonomy  approximates  the  ideal? 
(Approximation  in  this  sense  may  mean  either  the  probability  of  obtain¬ 
ing  a  set  that  is  too  small  or  too  large  by  a  specified  margin. ) 

A  comparison  between  one  type  of  invariance  and  another  now  becomes 
possible.  These  invariances  that  result  in  a  q\alck  convergence  of  the 
corrective  procedures  are  desirable.  Conversely,  it  is  possible  to 
investigate  the  suitability  of  rules  for  complementing  and  reassigning 
descriptors  by  keeping  a  set  of  invariant  relationships  constant.  All 
these  problems  can  be  Investigated  mathematically. 

Summary  -  There  is  an  inherent  problem  in  accomodating  the 
descriptors  selected  for  a  set  of  documents  by  indexing  rules  to  the 
descriptors  used  by  the  user  of  a  system.  This  problem  is  related  to 
the  extensional  difference  in  the  denotation  of  descriptors  or  words 
in  an  external  and  an  ideal  taxonomy.  This  discussion  described  methods 
for  developing  corrective  procedures,  which  would  be  applied  automatically, 
to  relate  the  external  to  the  ideal  taxonomy.  The  basis  for  developing 
the  inferential  rules  for  these  procedTires  is  the  concept  of  invariance. 

This  problem  is  real,  but  it  is  also  peripheral.  It  is  more  impor¬ 
tant  to  develop  an  adequate  indexing  concept  first;  only  then  does  the 


question  of  efficiency  become  ingxsrtant,  A  signifioant  amonnt  of 
mathematical  fomuLation  remains  before  the  adequate  correctiTe  prb? 
oedures  can  be  in^lemented* 

h.$  MATHEMfl-TICAL  MODELS  OF  FUNCTIONS 

U.^.l  General  -  This  section  surveys  and  summarizes  some  basic  con¬ 
cepts  of  information  storage  and  retrieval  and  their  related  mathematical 
models.  These  models  pertain  to  particular  functions  and  are  thus  dif¬ 
ferentiated  from  the  general  system  model j  in  effect,  this  discussion, 
iifaieh  is  based  tq)on  and  derived  from  Ha3res  [19,  26],  initiates  the  frame- 
■Hork  for  the  formal  analysis  and  developnent  of  the  transform  functions. 
The  elaboration  of  this  framework  will  be  performed  during  subsequent 
quaid^rly  periods, 

A  general  theory  of  information  retrieval  should  enconqjass  at  least 
the  following  aspects  of  the  problem  of  storage  and  retrieval: 

(a)  Representation  of  file  items* 

(b)  File  organization, 

(c)  System  design  and  synthesis. 

These  aspects  of  a  system  do  not  exhaust  the  elements  that  shoiild  be  con¬ 
sidered;  for  example,  the  measures  of  relevance  presented  in  the  First 
Quarterly  Report  also  constitute  an  integral  aspect  of  system  design. 

A  model  may  be  an  elegant  representation  of  a  trivial  problem  or  a 
simple  representation  of  a  difficult  problem.  There  has  been  no  attengit 
to  evaluate  the  significance  of  the  following  models,  since  their  purpose 
is  to  €ixploT9  the  nature  of  tbs  problems  rather  than  to  solve  them 
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eaq)licitly  and  efficiently.  Subsequent  analysis  wLll  be  directed  to  an 
evaluation  of  various  models  in  terns  of  their  relation  to  the  system 
model  and  their  contribution  to  the  solution  of  the  functional  problems 
of  information  storage  and  retrieval. 

U.5.2  Representation  of  File  Items  -  The  raw  material  of  an  informa¬ 
tion  retrieval  system  consists  of  docinnents,  requests,  and  the  words  or 
terms  used  in  requests  or  in  referring  to  or  classifying  documents.  A 
representation  of  an  element  of  any  of  these  classes  will  be  called  a 
file  item.  File  items  are  organized  by  means  of: 

(a)  Vocabulary. 

(b)  Syntax. 

(c)  Coding  and  format. 

Some  of  the  factors  of  each  element  of  a  file  item  are  discussed  briefly 
before  a  model  for  item  definition  is  presented. 

U.5.2.1  Vocabulary  -  There  are  six  general  types  of  vocabularies. 
These  types  represent  a  spectrum  from  unorganized  or  highly  flexible  to 
highly  organized  or  rigidly  structirred  and  restrictive.  They  are  listed 
in  order  of  flexibility,  proceeding  from  the  most  flexible  to  the  most 
structured. 

(a)  Natural  language. 

(b)  Standardized  (keywords). 

(c)  Subject  headings. 

(d)  Semantic  factors. 

(e )  Classifications . 

(f )  Facet  ansQysis . 


The  first  tTpe  is  written  or  conversational  languags;  it  peznits 
the  use  of  all  the  words  and  phrases  in  the  language,  subject  only  to  the 
rules  of  grammar  and  meaning.  The  second  type  if  restricted  to  a  prescribed 
set  of  words;  it  is  discussed  by  Taube  [28]  and  Jonker  [143*  In  the  third 
•^irpe  a  restricted  set  of  words  is  also  organized.  The  fourth  Igrpe  is  sum¬ 
marized  by  Vickery  [303,  A  more  con^jlete  development  appears  in  Kent  [15]» 
The  fifth  type  is  represented  by  the  well  known  Dewey  decimal  and  Idbraxy 
of  Congress  classification  systems.  The  last  lype  is  described  by 
Ranganathan  [2U3* 

The  most  common  model  for  describing  semantic  relations  in  vocabu¬ 
laries  is  the  lattice.  The  lattice  model  is  useful  primarily  because  cer¬ 
tain  lattices  can  be  decomposed  into  the  direct  products  of  two  lattices 
so  that  vocabulary  structures  can  be  exhibited.  A  theorem  to  this  effect 
appears  in  Birkhoff  [3]. 

1;,3.2.2  Syntax  -  A  discussion  of  this  area  for  a  sophisticated 
vocabulary  like  natural  laiiguage  would  be  quite  discursive  and  outside  the 
scope  of  this  project.  However,  for  most  existing  information  retrieval 
systems,  a  document  is  represented  by  a  simple  conjunction  of  terms.  Cor- 
resijondlngly,  a  request  is  represented  by  a  disjunction  of  conjunctions  of 
terms.  The  disjunctions  Indicate  separate  file  items.  In  a  fixed  format 
system  such  as  Uniterm,  for  example,  the  syntactical  role  is  a  sinqOe  one; 
it  is  mere  presence  or  absence.  However,  in  some  systems  the  order  of 
terms  in  a  request  plays  a  syntactic  role. 

U«^*2.3  Coding  and  Format  -  Coding  and  format  pertain  to  the 
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optimal  representation  of  requests  and  documents.  The  specific  problem 
is  the  relationship  between  the  length  of  the  representation  code,  its 


effectiveness,  and  the  information  that  can  be  retrieved  in  these  terms. 

The  model  appropriate  for  measuring  information  content  in  various  repre¬ 
sentations  involves  information  theory,  either  in  a  semantic  or  a  clas¬ 
sical  sense.  The  classical  theory  has  been  will  developed,  but  there 
has  been  hardly  any  development  of  an  information  theory  based  upon  seman¬ 
tic  concepts . 

k*$.2,h  Model  for  Item  Definition  -  The  model  in  this  discussion 
is  geometric j  it  is  not  the  only  possible  model.  Each  document,  request, 
or  term  to  be  represented  is  considered  as  a  physical  bod^  that  occupies 
volume  and  has  a  mass  distribution  in  a  multidimensional  space.  The  volume 
can  be  interpreted  as  the  volume  of  taiowledge  encon^assed  by  the  docuiaentj 
then  the  mass  distribution  represents  the  contribution  of  ,a  docxunant  to 
each  point  of  knowledge.  In  an  actual  retrieval  system  the  body  ill  con¬ 
sist  of  a  discrete  set  of  points  in  this  space .  It  is  assumed  that  there 
is  a  measure  of  distance  and  angle  in  this  space  so  that  distances  between 
points  and  the  centers  of  gravlly  of  sets  of  points  can  be  computed. 

A  set  of  coordinate  points  is  selected  and  the  location  of  any 
other  point  is  defined  by  Barycentric  coordinates.  In  this  iype  of  coor¬ 
dinate  system,  any  point  is  represented  as  the  center  of  gravity  of  a  dis¬ 
tribution  of  mass  at  each  of  the  coordinate  points.  Thus,  a  point  can  be 
located  geometrically  and  assigned  a  mass. 

This  general  model  can  be  used  in  several  ways  to  represent  file 
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items;  one  such  use  is  illustrated  in  Figure  2.  A  set  of  key  words  is 
chosen  as  the  hasic  set  of  coordinate  points  in  the  space.  Specifically 
Pi  to  Pi2  are  the  points,  each  of  whioh  represents  a  key  word.  Pp  P2. 
and  can  be  interpreted  as  documents.  The  key  words  assigned  to  ?! 
are  Pi  through  p^j  to  Pg,  p^  through  Pg;  to,  Pj*  P9  through  Pig.  The 
documents  are  located  at  the  center  of  gravity  of  their  assigned  key 
words.  Thus  the  Barycentric  coordinates  correspond  to  the  assignment 
of  relative  importance  of  each  key  word  to  the  document. 

_  Pi,  Pg,  and  P^  in  Figure  2  can  also  be  interpreted  sis  portions 
of  documents.  Then  the  representation  of  a  file  item  by  a  set  of  points 
is  conparable  to  the  representation  of  a  document  P  by  the  disjunction  of 
conjunctions  of  terms.  Each  conjunction  (Pi,  Pg,  or  P^)  represents  the 
center  of  interest  of  a  major  section  of  the  document.  If  the  document 
treats  a  relatively  restricted  topic,  as  in  the  case  above,  one  conjunction 
may  be  adequate  to  describe  its  contents.  If  it  is  concerned  with  several 
unrelated  topics,  then  several  will  be  required.  Each  such  conjunction 
corresponds  to  a  single  point  in  the  space  ttat  was  defined.  These  points 
are  Pi,  Pg,  and  P^.  The  information  content  of  any  of  these  points  is 
represented  by  the  set  of  associated  key  words.  Just  as  before,  each  key 
word  defines  a  point  with  a  given  mass,  so  that  the  point  of  interest  is 
the  center  of  gravity  of  the  mass  distribution  at  the  key  word  points. 
Hence,  Barycentric  coordinates  correspond  to  the  assignment  of  the  relative 
iiQwrtance  of  each  key  word  to  a  conjunction  that  represents  the  infoxna- 
tion  in  a  iwrtion  of  a  document. 
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In  this  sense,  P  is  equal  to  the  disjunction  V  V 
vbere  P^,  Pg,  and  P^  are  conjunctions  of  their  associated  key  vords* 


U.^.2.5  Releyance  -  In  order  to  organize  and  classify  terms  and 
documents  and  to  answer  requests  effectively  it  is  essential  to  have  some 


neastire  of  the  degree  of  asspclatlon  or  relevance  of  terms  or  doctiments . 
Several  such  measures  were  discussed  in  the  First  Quarterly  Report.  Sev¬ 
eral  others  could  be  mentioned:  root  mean  square,  Tchebychev  sum,  mirdmax, 
neampss  in  a  Boolean  lattice,  and  the  chi-square  formula.  Most  of  these 
measures  are  either  special  cases  of  Barycentric  coordinate  weightings  or 
an  means  of  order  p.  A  mean  of  ortter  p  for  a  set  of  elements  is 
defined  as: 

Some  of  these  measures  can  be  rejected  out-of-hand  as  counter-intuitive j 
others  would  have  to  be  evaluated  experimentally. 

U.5»3  Pile  Organization  -  The  purpose  of  file  organization  is  to 
collect  items  that  are  logically  related  because  they  are  likelly  to  be 
wanted  together  whether  formally  requested  or  not.  A  secondary  purpose 
is  to  inprovo  access  time  to  items  that  are  requested  or  retrieved  fre¬ 
quently.  Accordin^y,  there  are  four  facets  of  file  organization  to 
consider: 

(a)  Logical  organization. 

(b)  Activity  organization. 

(c)  Physical  organization. 

(d)  Reorganization. 

Each  facet  is  analyzed  in  turn  in  the  following  discussion. 

Ii.5»3«l  Logical  Organization  -  The  process  of  coordinate  index¬ 
ing  assigns  terms  to  documents.  A  matrix  can  be  formed  with  the  colTimns 
as  terms  and  the  rows,  documents.  An  element  a^^  of  the  matrix  is  one  or 
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zero  depending  t^n  vhetber  the  term  in  assigned  to  the  i—  doctment. 
This  natrix  is  the  docnnent-term  matrix.  The  elements  of  the  document- 
term  matrix  can  be  generalized  from  a  single  YES  or  NO  association  to 
Heights  that  represent  the  relative  impoirtance  of  the  association  between 
a  tem  and  a  document.  The  document-term  matrix  generally  assigns  many 
texms  to  a  particular  doctiment.  Consequently,  the  retrieval  of  documents 
reqtdres  the  specification  of  the  particular  class  of  pertinent  informawr 
tion  as  a  logical  conjunction  of  terms.  Boolean  algebra  or  lattice  theory 
is  required  to  specify  a  particular  class  of  documents. 

Although  the  document-term  relationship  may  be  used  as  a  tool 
for  logical  organization,  the  relationships  among  terms  and  among  documents 
in5)licit  in  the  assignment  of  terms  to  documents  are  not  fuliy  revealed 
by  the  Boolean  algebra  or  lattice  structures.  For  example,  the  fact  that 
two  documents  have  similar  assignments  of  terms  is  not  apparent  from  their 
common  assignment  to  the  classes  of  documents  defined  by  each  of  the  terms. 
Biwever,  this  degree  of  association  can  be  displayed  by  forming  a  term-texm 
or  docuonent-doeument  matrix.  The  elements  of  these  matrices  would  be 
values  of  relevance  obtained  from  the  document-term  matrix  by  using  some 
previously  defined  measure  of  relevance  to  conpeue  rows  or  columns . 

The  objective  then  is  to  recover  information  about  the  possible 
groupings  of  documents  or  teiros  from  these  association  matrices.  The 
groups  found  can  be  used  as  classes  for  defining  a  generic  relationship 
among  terms  or  as  a  classification  for  grouping  documents.  Several  mathe¬ 
matical  methods  can  be  used  to  extract  significant  factors  from  an  asso¬ 
ciation  matrix.  They  Include  at  least  the  following:  Eigenvalue  analysis. 
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factor  analysis,  powers  of  tbe  association  matrix,  and  the  theorjr  of 
cltmips  developed  ^7  the  Cambridge  Language  Beseairoh  Unit  in  England. 

ThMe  techniques  are  developed  in  [1,  U,  8,  12,  21,  26,  29l» 

kasr  of  these  methods  produces  factors  or  abstract  concepts  that 
are  described  by  relative  weightings  of  the  teims  or  documraits  from  the 
original  set.  A  relative  weighting  of  the  original  set  of  points  can  be 
represented  by  a  single  point  located  at  the  center  of  gravity  of  these 
weights.  This  point  is  identical  in  character  to  any  other  point  in  tbe 
space  and  to  any  point  that  might  have  been  chosen  to  represent  a  term  or 
a  file  item.  Hence,  any  set  of  points  related  by  some  degree  of  associa¬ 
tion  can  be  groxqnd  into  a  category  labelled  with  the  center  of  gravity  of 
tbe  set.  The  abstract  terms  can  the  same  methods  themselves  be  grouped 
into  a  higher  order  concept.  This  technique  provides  a  meaxu  for  organis¬ 
ing  tbe  set  of  points  of  the  space. 

Activity  Organization  -  Files  can  also  be  organized  on 
the  basis  of  activity;  that  is,  by  grouping  items  according  to  the  like¬ 
lihood  that  they  will  be  wanted  together.  This  type  of  organisation  can 
be  sxgperi]iq>osed  upon  a  logical  organization  of  a  file. 

Tbe  aim  of  activily  organization  is  to  produce  a  hierarchical 
arrangement  such  as  nested  boxes  or  levels  of  grouping.  Such  an  arrange¬ 
ment  is  Illustrated  in  Figure  3.  Each  box  represents  a  grouping  at  s(»De 
level  cf  abstraction,  the  level  being  described  by  the  relative  size  of 
the  box.  The  smallest  boxes  or  lowest  level  contain  individual  raw  file 
items.  If  the  cover  of  any  box  is  renoved,  the  interior  of  the  box  contains 
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FTGBRE  3 .  Activity  Organized  File  of  Nested  Boxes 

a  nest  of  smaller  boxes  of  the  same  general  character.  For  example,  if 
the  cover  of  a  box  labelled  1,  2,  or  3  in  Fignre  3  were  removed,  it  would 
appear  something  like  the  box  labelled  Aj  if  the  covers  of  boxes  U, 
and  6  were  removed,  their  contents  eould  look  something  like  box  B,  which 
contains  boxes  7  to  13* 

Each  box  is  labelled  by  the  pattern  representing  the  center  of 
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gravity  of  the  patterns  contained  irLthin  it.  The  actual  size  of  the  box 
at  any  level  is  determined  from  the  distribution  of  ths  documents  in  it 
and  their  logical  relationship.  This  distribution  can  be  determined  from 
the  past  concentration  of  activity,  from  the  value  of  information  contained 
in  the  documents,  or  from  a  uniform  distribution  over  the  space.  However, 
the  number  of  levels  (size  of  boxes)  open  at  any  time  is  dependent  upon 
the  a  priori  distribution  p(x)  of  probable  activity.  The  boxes  are  so 
designed  that  the  integral  of  the  probability  p(x)  of  each  hex  (independ¬ 
ent  of  its  size)  is  eqToal  for  all  boxes  that  are  visible  at  a  given  time. 
Ihen  it  is  equally  likely  that  the  answer  to  any  request  is  in  a  given 
visible  box  independent  of  its  size.  For  exanqple,  if  mere  entry  into  the 
file  were  the  removal  of  a  box  cover  from  the  entire  file,  then  the  visi¬ 
ble  box  structure  might  be  that  of  Figure  3.  This  structure  indicates  that 
the  probability  of  finding  the  answer  to  a  request  in  box  2  is  the  same  as 
the  probability  of  finding  the  answer  in  box  3.  The  boxes  not  visible  in 
Figure  3  represent  lower  levels  of  activity  in  the  file. 

If  a  certain  box  is  active,  its  contents  are  examined;  these 
contents  consist  of  a  set  of  boxes  of  equal  probable  activity.  This 
process  is  continued  until  a  request  is  answered  satisfactorily  by  a  pat¬ 
tern  representing  a  box  at  some  level,  ultimately  by  a  document.  Given 
a  measure  of  the  conditional  probable  activity,  given  present  activity  at 
time  t,  the  boxes  are  arranged  in  order  according  to  this  measure.  The 
determination  of  the  actual  relevance  of  documents  to  a  situation  and  the 
selection  of  an  adequate  response  involves  the  matching  of  a  request 
kgainst  the  available  box  patterns;  that  is,  the  successive  box  labels 


are  scanned  and  matched  against  the  request  patteim.  The  selected  box  is 
then  opened,  and  its  contents  are  scanned  for  a  match  with  the  request 
pattern.  This  process  is  continued  mtil  the  request  is  answered 
satisfactorily. 

The  set  of  patterns  representing  the  labels  of  the  boxes  that  are 
visible  at  ary  time  is  equivalent  to  an  index.  The  index  is  scanned,  and 
it  indicates  where  in  the  space  further  attention  should  be  directed.  The 
basis  for  organization  by  p(x)  is  that  in  scanning  an  index  at  a  certain 
level,  some  of  the  patterns  are  references  to  groups  of  patterns  at  the 
nert  index  level,  but  some  are  references  to  lower  levels  because  of  the 
volume  of  usage  of  patterns  there. 

Mathematical  expressions,  which  indicate  the  number  of  boxes  at 
each  level  of  a  file  and  the  expected  number  of  box  covers  removed  in  a 
search,  can  be  derived  in  terms  of  the  number  of  levels  of  the  file  and 
the  number  of  parts  in  a  single  partition.  The  cost  of  a  search  is 
directly  related  to  box  size  and  could  be  used  in  addition  to  relevance 
as  a  criterion  for  selecting  boxes  whose  contents  are  to  be  examined, 

l4..^.3.3  Physical  Organization  -  There  is  a  relation  between  the 
logical  file  organization  and  the  physical  organization  of  a  system.  The 
logical  file  organization  can  be  represented  by  a  tree  structure  where  only 
the  terminal  nodes  are  basic  file  items j  the  nodes  on  other  levels  repre¬ 
sent  higher  level  abstractions.  The  cost  of  searching  such  a  tree  begin¬ 
ning  at  the  top  is  a  function  of  the  number  of  levels  of  the  tree,  the 
mmiber  of  nodes  at  each  level,  the  number  of  branches  that  most  be  searched. 
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and  the  access  tine  for  each  node..  . 


The  cost  of  such  a  search  is  a  measure  of  efficiency  of  the  physi¬ 
cal  organization  of  the  file.  If  cost  is  a  monotonically  increasing  func¬ 
tion  of  tine,  then  ndninom  cost  and,  therefore,  naxiniun  efficiency  are 
achLeved  in  ndniiinm  search  time.  The  aTerage  search  tine  T  can  be  repre¬ 


sented.  by: 


where: 


L  ”i 

t 
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^  2  p(i,j)  (t^  + 1  ) 

i=l  .1=1 


(U-25) 


P(i,j)  =  probability  of  selecting  node,  level  i 


th 


t^^  =  time  to  access  j —  node,  level  i 


t  .  .  =  time  for  selection  process 
sij 

n.  =  number  of  nodes  on  level  i 


th 


xu>de,  level  i 


number  of  file  levels 


There  are  two  methods  for  reducing  the  average  search  time  in 
such  a  tree  stnasture.  If  an  estimate  of  the  file  activity  is  available, 
the  order  in  which  the  nodes  are  processed  may  be  revised,  allowing  a 
reduction  in  either  or  both  the  access  and  the  process  tines.  This 
process  reflects  an  activlt7  organization.  The  second  method  is  to  move 
tezbinal  nodes  to  a  higher  level  in  the  tree.  Then  searches  can  be  ter¬ 
minated  without  processing  all  levels  of  the  file  (tree). 

For  activity  organization  the  minimum  value  of  T  is  obtained 
when  the  highest  probabllily  is  associated  with  the  lowest  tine  (that  is, 
the  sum  of  access  and  selection  times),  the  next  highest  probability  with 
the  next  lowest  time,  and  so  on. 


For  the  hierarchical  organization,  T  is  minimum  when  the  elements 

with  the  .highest  probability  are  hlgheet  in  the  tree.  Since  this  type  of 

organization  changes  the  structure  of  the  troe  itself,  minimum  cost  C  does 

not  necessarily  occur  at  minimum  T.  Consequently,  the  criterion  for  moT- 
*fcll 

ing  the  j—  node  from  level  k  to  k  -  1  is: 

Ck_iT(j,k  -  1)  <  Cj^T(j,k)  (U-26) 

In  the  application  of  this  criterion  all  nodes  are  first  assigned  to  the 
lowest  level  of  the  tree  and  a  minimum  CT  obtained.  Moving  nodes  to  the 
next  higher  level  is  then  considered  in  order  of  their  probability.  Tiilhen 
the  criterion  is  violated,  no  other  moves  on  that  level  need  be  considered 
because  all  the  remaining  nodes  on  that  level  have  a  probability  less  than 
or  equal  to  the  node  that  violated  the  criterion.  The  nodes  on  the  next 
higher  level  are  then  considered.  After  the  moves  from  one  level  to  the 
next  are  completed,  the  evaluation  begins  again  at  the  lowest  level  of  the 
tree  in  order  to  ascertain  whether  these  moves  have  adversely  affected  the 
efficiency  of  earlier  moves .  The  evaluation  moves  up  the  tree  until  the 
first  new  level  is  processed}  then  it  re-cycles.  This  procedure  is  com¬ 
pleted  when  the  node  with  the  highest  probability  violates  the  criterion 
or  when  all  levels  of  the  file  have  been  processed. 

File  Reorganization  -  The  usage  of  information  retrieval 
systems  changes  with  time.  Consequently,  the  distributions  upon  which  an 
activity  organized  file  are  based  change  with  time.  On  the  basis  of  this 
and  improved  knowledge  of  the  value  and  proper  position  of  doctunents  in 
the  file,  a  need  exists  for  a  procedure  that  automatically  changes  the 
groTq>ing,  accessibility,  and  scamdng  sequence  of  file  items. 
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One  approach  to  sioch  a  procedure  is  based  upon  a  multi-level 
activity  organized  file  with  certain  logical  associations.  Suppose  stored 
patterns  of  bits  of  fixed  word  length  are  divided  into  three  parts  called 
stimulus,  response,  and  index.  Tte  stimulus  and  response  sections  of  the 
pattern  consist  of  groups  of  pairs  of  bits.  Each  pair  of  bits  corresponds 
to  a  particular  characteristic  of  interest.  There  are  four  possible  values 
or  patterns  for  a  bit  pair.  Three  of  these  values  correspond  to  values  of 
high,  raediTuti,  and  low  for  the  given  characteristic  with  respect  to  a  par¬ 
ticular  pattern.  Values  for  some  characteristics  come  from  the  environmentj 
others  are  deteimined  from  file  operations.  The  bit  pair  of  any  character¬ 
istic  that  must  be  determined  by  file  operation  is  assigned  the  fourth 
possible  value,  which  will  be  interpreted  as  a  question.  The  only  reason 
for  distinguishing  between  stimulus  and  response  sections  of  a  pattern  is 
to  indicate  that  the  stimulus  characteristics  are  generally  prescribed  by 
the  environment  while  the  response  characteristics  are  provided  by  the  file. 
However,  tlds  division  is  not  based  upon  necessity  but  only  upon  probability. 

The  operation  of  this  file  may  be  described  with  a  sinqple  two-level 
filej  the  model  can  be  extended  without  difficulty.  The  first  level  stores 
a  limited  number  of  patterns;  the  second  level  has  the  capacity  to  store 
an  indefinite  nuniber  of  patterns — that  is,  it  will  be  large  enough  to  handle 
all  patterns  not  on  the  first  level.  In  generating  patterns  the  environment 
prescribes  values  for  certain  characteristics  and  leaves  questions  for  the 
remainder  where  values  must  be  supplied  by  file  operations.  A  partially 
p3?escribed  pattern  of  this  type  is  a  send-pattem.  The  semi -pattern  is 
then  matched  according  to  sone  rule  of  association  with  the  patterns  store<i 
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in  the  first  level  of  the  file.  This  process  results  in  a  relative  ranking 
of  these  patterns  in  their  order  of  association  uith  the  semi -pattern. 

From  the  patterns  that  match  the  send-pattem  to  a  degree  greater  than  a 
specified  minimtim  relevance,  those  that  are  most  relevant  to  the  semi- 
pattern  are  selected.  Values  for  the  question  hits  (characteristics) 
of  the  semi -pattern  are  provided  by  relative  weighting  of  the  correspond¬ 
ing  characteristics  of  the  most  relevant  stored  pattern.  If  none  of  the 
stored  patterns  at  the  first  level  have  a  relevance  greater  than  the  pre¬ 
scribed  minimum,  patterns  must  be  remembered  from  the  next  level. 

Patterns  created  from  the  environmental  semi-patterns  and  file- 
created  answers  to  questions  are  stored  in  the  first  level.  Since  the 
storage  capacity  of  the  first  level  is  fixed,  it  will  eventually  be 
exceeded.  Therefore,  a  process  must  be  introduced  for  forgetting  pat¬ 
terns  |  that  is,  for  transferring  patterns  to  the  second  level.  The 
procedure  is :  A  quantity  is  determined  by  a  relative  weighting  of  past 
relevancy  of  each  first-level  pattern  and  the  present  relevsmcy  of  the 
pattern  to  the  send. -pattern.  In  terms  of  this  quantity  the  least  rele¬ 
vant  pattern  or  group  of  patterns  is  forgotten.  Using  the  same  rule  that 
was  used  for  determining  relevance  to  the  environment,  the  first-level 
pattern  most  relevant  to  the  pattern  to  be  forgotten  defines  the  location 
in  which  the  pattern  to  be  forgotten  will  be  stored.  This  address  is 
determined  from  the  indexing  portion  of  the  relevant  pattern.  Ihe  index¬ 
ing  section  of  a  pattern  consists  of  three  addresses;  a  starting  addtess, 
the  next  available  address,  and  the  last  address  assigned  in  the  second 
level  of  the  file  to  the  relevant  pattern.  The  forgotten  pattern  is 


51 


stored  in  the  xiext  available  address  assigned'  to  this  rti.evant  pattetn, 
and  the  next  available  address  is  t^xlated;  - 

When  the  patterns  stored  at  the  first  level  do  not  match  a  send- 
pattem  to  the  specified  miniinum  degree  of  relevance,  patterns  itiust  be 
remembered  or  recalled  from  the  second  level  by  means  of  the  indices  of 
the  most  relevant  patterns,  even  though  they  are  below  the  acceptable 
ndnimum.  The  index  section  of  the  most  relevant  pattern  at  the  first 
level  thus  provides  a  mechanism  for  obtaining  a  pattern  from  the  second 
level,  bringing  it  to  the  first  level,  and  examining  it  for  relevancy. 
This  process  is  continued  until  sufficiently  relevant  patterns  are  found 
or  until  no  further  index  data  is  available.  If  neither  of  these  con-  . 
ditions  occurs  after  a  reasonable  prescribed  time,  the  process  can  be 
stopped  arbitrarily;  alternatively,  the  process  can  be  stopped  whenever' 
a  new  semi-pattern  is  accepted. 

System  Design  and  Synthesis  -  Detailed  consideration  of  sys¬ 
tem  design  and  synthesis  should  be  postponed  until  the  other  areas  have 
been  developed  to  a  greater  extent.  The  other  areas  are  not  system  ori¬ 
ented,  while  this  one  is.  It  therefore  constitutes  the  last  phase  in  the 
development  of  a  theory  of  information  retrieval.  A  convenient  subdivi¬ 
sion  of  this  phase  is; 

(a)  Organization  of  processes. 

(b)  Organization  of  equipment  2md  personnel. 

(c)  Evaluation  of  system  efficiency. 

For  the  sake  of  completeness,  this  area  will  be  discussed  briefly. 


52 


Ori^anlzation  of  Processes  -  The  organization  of 
processes  is  sometjjnes  called  the  logical  design  of  a  system.  The  end 
product  is  usually  a  set  of  flow  charts.  These  charts  would  show  the 
sequence  of  functions  to  be  performed,  decisions  and  alternatives,  points 
of  interrelation  and  feedback,  and  the  inputs  and  outputs  for  each  func¬ 
tion,  !niere  is  as  yet  no  adequate  mathematical  method  for  isolating  sys- 
ten  functions  and  completing  the  logical  design.  The  resultant  flow 
charts,  however,  do  serve  as  a  sort  of  sebematic  graphical  model  of  the 
system  design, 

1;.5,U,2  Organization  of  Equipment  and  Personnel  -  The  objective 
of  this  area  is  to  allocate  tasks  or  assign  functions  to  equipment  and 
personnel.  Criteria  for  these  allocations  are  the  flexibility,  speed, 
and  accuracy  requirements  of  the  various  functions  and  subfunctions  com¬ 
prising  the  system.  To  date  the  allocation  of  functions  to  men  and 
machines  has  been  an  art  largely  constrained  by  the  rigidity  of  computer 
techniques  for  associating,  classitying,  storing,  and  retrieving  data. 

In  other  words,  all  those  functions  that  could  not  be  automated  with  the 
required  degree  of  flexibility  have  been  allocated  to  personnel,  lii^jrove- 
ments  in  this  function,  therefore,  will  not  depend  iq>on  matbematieizing 
the  process  but  upon  dsveloping  better  mathenatical  models  in  the  areas 
of  file  item  representation,  file  organization,  and  evaluation  of  system 
efficiency,  and  related  problem  areas, 

U,5,U,3  Evaluation  of  System  Efficiency  -  Adequate  criteria  for 
measttrlng  the  value  of  an  information  system  have  not  yet  been  developed. 
Therefore,  models  of  system  efficiency  must  be  viewed  as  aids  to  design. 
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iifaleh  nay  conflm  intuitlre  judgamts,  but  not  aa-  odoqoate  topis  in 
tbaaselTOS  to  sake  design  decisions. 


An  infonnatlon  system  is  a  coUeetion  of  eonponents  that  in  con¬ 
cert  perform  a  set  of  operations  to  accomplish  a  specific  purpose.  The 
system  is  represented  by  a  matrix  A  of  efficiency  values.  The  rows  of 
this  matrix  correspond  to  the  Individual  operations.  The  element  e^^  is 

+  V»  +^4 

the  efficiency  with  which  the  i~  component  performs  on  the  j— =  operation. 
The  component  efficiencies  could  be  defined  by  some  parameter  such  as  the 
product  of  cost  in  dollars  per  unit  of  time  and  the  operational  time  divided 
by  the  ntimber  of  bits  processed;  that  is,  the  efficiency  has  units  of 
dollars  per  bit.  A  volume  vector  v  can  be  defined  whose  components  are 
the  volumes  or  traffic  loads  for  each  operation.  The  product  of  the 
efficiency  matrix  and  volume  vector  is  defined  as  the  required  cost 
vector  C  whose  conpionents  are  the  costs  required  to  perform  the  given 
voIthdb  of  the  set  of  operations  at  the  defined’  efficiencies.  There  are 
practical  problems  in  determining  the  various  parameters,  but  these  will 
be  Ignored  in  illustrating  the  model.  Using  a  x«s  measure  of  efficiency 


E  yields: 


(li-27) 


where  A  Indicates  the  transpose  of  the  matrix  A.  The  quantity  under  the 
radical  on  the  right  is  the  Rayleigh  quotient  for  .the  luitrix  A  A.  Effi¬ 
ciency  can  now  be  maximized  by  the  methods  of  Eigenvalue  analysis .  A 
generalization  of  the  classical  Eigenvalue  theory  is  required  to  handle  a 
non-squaxs  matrix  A,  directly.  This  mathematical  generalisation  is  avail¬ 
able  in  Hestenes  [I3l» 
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SimmiaiT  -  This  section  discussed  some  mathematical  models  and 
their  purposes  as  related  to  specific  problem  areas  in  information  xetrieTal. 
Ihese  models  axe  related  in  the  foUoming  coherent  summary: 


(1)  Objectlre  -  Description  of  semantic  relations 

(2)  Data  Source  -  Vocabularies 

(3)  Model  -  Lattice 


(1)  Objective  -  Measvirement  of  information  content 

(2)  Data  Source  -  Document  abstract  size 

(3)  Model  -  Information  theory 


(1)  Objective  -  Measurement  of  relevancy  and  categorization  in 

terms  of  it 

(2)  Data  Source  -  Document-term  matrix 

(3)  Model  -  Matrix  algebra 


(1)  Objective  -  Measurement  and  optimization  of  responsiveness 

(2)  Data  Source  -  Activity  distributions 

(3)  Model  -  Nested  box  structures 


zanizatLon  of  Files 


(1)  Objective  -  Optimization  of  physical  organization  of  files 

(2)  Data  Source  -  Facility  costs  and  operating  rates,  personnel 

costs  and  operating  rates,  sequence  of  opera¬ 
tions,  and  activity  distributions 

(3)  Model  -  Average  cost  of  a  sezurch 


(1)  Objective  -  Definition  of  programmable  processes  for  file 

reorganization 

(2)  Data  Source  -  Statistics  of  environaent 

(3)  Model  -  Multl-levti.  index-conneeted  file 


(1)  Objective  -  Measurement  and  optimization  of  system  efflcieiuy 

(2)  Data  Source  -  Component-operation  performance  analysis  result¬ 

ing  in  the  conponent-operation  efficiency  matrix 

(3)  Hodd.  -  Matrix  algebra 


55 


It  sbonld  136  en^haslzed  that  these  nodels  are  not  neeessaxdily  the  best 
nor  the  only  nodels  that  can  be  developed  to  solve  any  partictiian  px^blem. 
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5.  GOMCLnSIONS 

Four  aspects  of  the  research  ofiehtatiaa'lSefe  described  in  estahLishing 
the  frame  of  reference  for  this  project;  system-procediire,  real-hypothetical, 
hardware-softnare,  reductionnnanipulation.  A  theoretical — procedural,  hypo¬ 
thetical,  software,  manipTJ-ative — approach  has  been  adopted.  A  preliminary 
generalized  model  has  been  fomulated  as  a  basis  for  analyzing  detailed 
aspects  of  the  problem.  Several  procedural  areas  have  been  analyzed  in 
varying  degrees  of  formalization.  ,  The  interrelationships  among  the  func¬ 
tional  characteristics  of  the  preliminary  mbdhl  as  well  as  their  relation 
to  the  entire  problem  are  being  investigated.  There  remains  an  extensive 
task  of  formalizing  these  areas  into  an  integrated  whole  in  order  to  ftilfill 
the  objectives  of  the  program. 
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6.  FLAMS  FOR  THE  NEXT  QUARTER 

Activities  dtcring  the  next  quarter  will  proceed  with  the  over-all 
goal  of  developing  a  theoiy  of  information  retrieval  for  nse  as  a  tool 
in  the  design  of  information  retrieval  systems.  Work  will  include  at 
least  ihe  following  three  aspects  of  the  development  of  such  a  theory. 

(a)  A  statement  of  the  necessary  or  deslrahle  features  of  a  theory 
of  information  retrieval  together  with  a  breakdown  of  the 
essential  functional  elements  of  information  retrieval  and 
their  interrelationships. 

(b)  Continue  development  of  an  information  retrieval  model  based 
on  Item  (a)  and  the  preliminary  model.  This  work  will  incltde 
utilising  and  relating  results  of  Item  (c), 

(c)  Continue  work  on  functional  elements  of  the  model  and  techniques 
that  are  applicable  to  the  effective  performance  of  these  essen¬ 
tial  functions  (e.g.,  measures  of  relevance  as  applied  to 
descriptor  assignment). 

These  three  aspects  of  the  work  are  actually  levels  of  detail.  The 
first  pirovides  a  general  statement  of  the  objectives  of  the  research, 

I 

defines  essential  areas  of  effort,  and  provides  guidelines  and  defini¬ 
tions  for  use  in  the  development  of  the  theory.  The  second  level  of 
effort  develops  and  defines  the  essential  features  of  the  theory  to  the 
point  where  a  representative  model  is  meaningful.  It  will  Isolate  inde¬ 
pendent  functions  and  establish  relations  between  functions  that  are  not 
independent.  The  third  level  develops  detailed  techniques,  procedures. 
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and  netbodolo^  ttsefnl  for  the  design  of  an  effeeUre  infonnatlon  retrieral 
system. 

Daring  the  next  quarter  each  aspect  of  actl’vlty’  iiill  also  be  oriented 
to  the  definition,  development,  and  exposition  of  specific  tasks  sithin 
this  general  nethodologloal  framesoxk. 


7.  IDENTIFIGATiaH  OF  PEBSONNEL 


7.1  PEBSONNEL  ASSIQNMEMTS 

The  folloulng  personnel  vere  assigned  to  the  project  during  the 
period  covered  by  this  report: 


Name 

Title 

Kan-Hours 

Jacques  Harlow 

I&nager 

60 

Quentin  A.  Darmstadt 

Research  Specialist 

260 

Qeorge  Greenberg 

Senior  Specialist 

300 

Alfred  Trachtenberg 

Senior  Program  Analyst 

U25 

Ihe  man-hours  applied  to  the  project  during  this  period  deviated  sllghtHy 
from  the  schedule  because  of  conferences,  holidays,  and  vacations — all  of 
ehlch  eexe  heavily  concentrated  during  this  reporting  period. 

7.2  B&CKGROUHD  OF  ESRSCTINEI. 

The  backgrounds  of  the  personnd  assigned  to  the  project  were 
described  in  the  First  Quasrterly  R^rt.  No  new  personnel  were  assigned 
to  the  project. 


8.  APHaroiCES 


8.1  APFENDIX  A  -  Maxima  and  Mjnjjna  of  the  Meaatires 
The  behairlor  of  the  measures  of  goodness  and  the  irarious  entropy 
functions  will  now  be  examined.  Maxima  and  minima  in  terns  of  the 
and  p^j  are  summarized  in  Tables  1  and  2. 


For  these  tables,  it  is  assumed  that  A  is  chosen  such  that  A  =  — 

. . .  '  - .  ■  •  ■  -  ^e 

where  p  is  the  smallest  p.j  that  is,  p^  <p.  for  all  j.  For  the 

ftmctions  of  Table  1 — and  — the  pertinent  values  are  the  maxi¬ 

mum  and  minimum  values  in  terms  of  a  given  p  and  the  absolute  maximum 

o 

and  minimum  values  of  each  function. 


For  H  and  H^,  maxima  are  reached  when  the  probabilities  are  equal  or, 

for  a  particular  p  ,  when  the  other  p.  are  equal;  minima  are  reached  when 
a  j 

one  probability  becomes  a  maximum  and  the  rest  are  minima. 


Vhile  does  not  reach  an  absolute  maxLmum  when  H  does,  since  it  was 
assumed  that  A  -  it  does  reach  a  maximum  together  with  H  for  a  partlcu- 

lar  p_.  Then; 

.o 


-  -  S  pj  log  Pj  +  log  A  -  ^  S  Pj  log  Pj  -  log  p^ 

«I  J 

-  -  S  p  log  p,  -  (1  +  p  )  log  p 

3  0  e  e 


(8-1) 


1  _  Pe 

Therefore,  becomes  a  maximum  for  a  particular  p^  when  p^  ■  ^ 

for  j  /  e.  Then; 

"  Pe^  (lip)  -  (1  +  Pe)  log  p^  (8-2) 


The  largest  occurs  when  p^  -  l/s.  Then; 
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of  Entzopj’  Ftoictions 


TABLE  1  (Continued).  Haxlina  and  Minima  of  Entropy  Functions 


■Wmx  ■  (1  *  h’ 

N 

beocmes  a  jniniimim  for  a  particular  when  H  does;  that  is,  when  the 
raaadjilam  p^,  -  1  -  (k  -  1)  p^,  and  Pj  -  p^  for  i  t,  where  p^  ^  Pj  for 

aU  j.  Then: 

®iaiin  "  ~  Pe^  -  (k  -  1)  p^] 

-d  +  (k  -  1)  p^]  log  p^  (8-U) 

The  smallest  occurs  when  p^  ■  l/k»  Then: 

■  2  >=  (8-5) 

hecones  a  naxijnam  when  p^^  >  p^  for  all  j .  This  waxl,wim  can  be 
derlwad  bgr  using  Gibbs*  theorem,  as  in  Watanabe 

®l«ar  “  ^  ^e 

Tbs  largest  occurs  when  ■  l/il, 

^  »  (8-« 


becomes  a  minimum  when  p^^  becomes  one  for  the  particular  J  for 


ahloh  Pj  is  smallest.  Then: 


bat 


so 


imin 


A  ■  Vp. 


imin 


e 

-  0 


(8-8) 

(8-9) 

(8-10) 


Tor  tbs  fonetions  of  Table  2—4^,  M2>  and  there  are  three 
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tabtj!  2  (Continued).  MaxLna  and  Minima  of  Measures  of  Goodness 


“BXiinuni  and  ndniimim  values :  the  maxima  and  mlniTiia  for  a  given  p . 

V 

distrlbatlon;  the  maxima  and  minima  when  only  p  is  given;  and  the 
absolute  maxima  and  minima.  To  keep  the  notation  consistent  with  that 
of  Table  1,  these  maxima  and  minima  will  be  indicated  as  follows: 

^^imaxj*  ^maxj* 


are  the  maxima  for  a  given  p^  distribution.  Similarly, 

are  the  minima  for  a  given  Pj  distribution. 

^^2max»  ^Tn-in*  *^2min*  maxima  and  minima  when  only 

Pg  glTen,  and  \abgj^»  ^absmax*  ^absmin*  ^absmin* 
absolute  maxima  and  minima. 


=  H  -  is  maximized  for  a  particular  p^  distribution  when  is 
a  minimum  -  0) .  Then  is  simply  the  a  priori  entropy  H. 

■which  is  maximized  for  a  particular  p^,  is  simply  the  a  priori 
entropy  maximized,  is  the  absolute  maxlmrum  of  the  a 

priori  entropy. 


Similarly  the  minima  of  are  obtained  when  Hj.  is  set  equal  to 
^imax  ^®lmax  “  ^  minimizing  the  a  priori  entropy, 

“  H  -  S^  is  maximized  when  S^  is  a  minimum  (Sj^^  =  0)j  the  maxina 

are  singly  the  maxima  of  the  a  priori  entropy,  is  minimized  when 

h  ”  ®lmax  "  ■  Pe'  ”2mln  “  ^min  ”  ®imax  ^  “  ^min  ^  addition, 
^fzabsmln  ^hen  H  -  is  maximized  when 

h  "  ®lmin*  »A»  «laax'  ^JWbsmax*  «®P*ctivoly.  The 
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BcLnima  of  are  not  as  obrious,  for  the  conditions  of  maxijnising  and 
adnisdilng  can  be  oontradlctoxy.  It  is  best  to  analyse  tbe  ndjaina  of 
as  follows : 

^  “  Ha  -  \  J  Pj  log  Pj  +  log  A  +  2  log  ^ 


-  t  p  log  p,  +  E  p. .  log  ^ 
j  3  3  j  13  Pj 


(8-n) 


For  a  particular  p^  distribution,  occurs  when  p^^j  ■  Pj  for 


all  3 .  Therefore : 


■  -  I  Pj  log  pj  -H 


Then  for  a  particular  p  : 

0 

H3inin  Hjnin^ 


and  the  absolute  minimum  is  simply: 


H3absmin  *  Hjjjgjnjji* 


(8-12) 


(8-13) 


(8-11*) 


is  the  siDQilest  measure  of  then  all,  reaching  a  mazimun  lAien  is 
idninni,  and  a  minimum  when  is  maximum. 


-  log  A  -  =  +  E  p^j  log  ^ 

J  V 


(8-15) 


That  this  measure  is  always  greater  than  or  equal  to  zero  can  be  shown 


by  applying  Gibbs*  theorem: 


H^.SPylOgPy  -rPylOgP^ 


(8-16) 


But:  ^  P44  log  P4  4  “  2  Pm  log  P.  >0  (Gibbs*  theorem)  (8-17) 

.  J  ^  ^  j  ^ 


Therefore,  >  0. 


(8-18) 
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(8-19) 


The  maxlinum  of  is : 

“  %ax  "  A 
The  absolute  maximum  occurs  vhen; 

Pe  -  Ji  then  A  -  N  arid  =  log  N  (8-20) 
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